BeautifulSoupで上手にスクレイピングする方法 - ドラあり！*ドラゴンに挑むアリの物語 ** Python使いの日々

BeautifulSoupを使ってスクレイピングするときに、適切なタグ構造なら大体find一発で取得できる。

ところが同じ名前のタグをいくつも引っ張ってくるときは、結構泥臭いやり方が必要。

例文

次の例文を使ってスクレイピングをしてみよう。

print soup.prettify()
# <html>
#  <head>
#   <title>
#    this title
#   </title>
#  </head>
#  <body>
#   <p id="test_id" align="center">
#    This is test paragraph
#    <b>
#     test
#    </b>
#    .
#   </p>
#   <p id="test2_id" align="blah">
#    This is test2 paragraph
#    <b>
#     test2
#    </b>
#    .
#   </p>
#  </body>
# </html>

簡単なfindの使用例

Pタグでcenterのみ取得

soup.find('p', align="center")
# <p id="test_id" align="center">This is test paragraph <b>test</b>. </p>

Pタグでcenterのみ取得（文字だけ抽出する）

soup.find('p', align="center").string
# This is test paragraph <b>test</b>.

Pタグでcenter指定のIDを取得する。スライスで最初のもののみ取得する点に注意。

soup('p', align="center")[0]['id']
# u'test_id'

Pタグ内のボールド指定の文字列を取得（最初のもの）

soup.find('p').b.string
# u'test'

Pタグ内のボールド指定の文字列を取得（2番目のもの）。

スライスで2番目のもののみ取得する点に注意。

soup('p')[1].b.string
# u'two2'

さいごに

対象のXMLやHTMLにもよるが、最後の例のような泥臭いローカルルールで取得する場合が出てくる。