BeautifulSoupでhtmlをパーズ

(1)とりあえずのリンク抽出
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

from bs4 import BeautifulSoup

soup=BeautifulSoup(open("index.html"))

#title
print(soup.title.string)
v=(soup.find_all('a'))
for i in v:
        print(i['href'])


(2)より深い構造をとりだす

 <div class="xxx"><h2>xxxxx<span></span></h2></div>

の中身.

#-`- coding: utf-8 -*-
from bs4 import BeautifulSoup

soup=BeautifulSoup(open("index.html"))


v=soup.find("div",class_="xxx")
for i in v.find_all("h2"):
        print i.span.text
  • 1.0万円、5000円等を数値化
#-`- coding: utf-8 -*-
from bs4 import BeautifulSoup


soup=BeautifulSoup(open("index.html"))


v=soup.find("div",class_="xxx")
for i in v.find_all("h2"):
        str1=i.span.text
        #print str1
        print(int(eval(str1.replace(u"円","").replace(u"万","*10000").replace(u"千","*1000"))))
  • リスト内包表記だとこんな感じ文字列処理の部分は
v=soup.find("div",class_="xxx").find_all("h2")
v2=[int(eval(i.span.text.replace(u"円","").replace(u"万","*10000").replace(u"千","*1000"))) for i in v]
print(v2)