(1)とりあえずのリンク抽出
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup soup=BeautifulSoup(open("index.html")) #title print(soup.title.string) v=(soup.find_all('a')) for i in v: print(i['href'])
(2)より深い構造をとりだす
例
<div class="xxx"><h2>xxxxx<span></span></h2></div>
の中身.
#-`- coding: utf-8 -*- from bs4 import BeautifulSoup soup=BeautifulSoup(open("index.html")) v=soup.find("div",class_="xxx") for i in v.find_all("h2"): print i.span.text
- 1.0万円、5000円等を数値化
#-`- coding: utf-8 -*- from bs4 import BeautifulSoup soup=BeautifulSoup(open("index.html")) v=soup.find("div",class_="xxx") for i in v.find_all("h2"): str1=i.span.text #print str1 print(int(eval(str1.replace(u"円","").replace(u"万","*10000").replace(u"千","*1000"))))
- リスト内包表記だとこんな感じ文字列処理の部分は
v=soup.find("div",class_="xxx").find_all("h2") v2=[int(eval(i.span.text.replace(u"円","").replace(u"万","*10000").replace(u"千","*1000"))) for i in v] print(v2)