BeautifulSoup - 阿倫的秘密基地

強大的解析模組，可以針對網站爬取想要的內容

常用的解析器：

BeautifulSoup(html1.text, ‘html.parser’)
- python內建，執行速度中等，容錯能力強
BeautifulSoup(html1.text, ‘lxml’)
- 執行速度快，容錯能力強

使用text表示除去所有HTML標籤後的內容回傳

■ find()、find_all()

find()：會尋找第一個符合的並以字串回傳

find_all()：尋找所有符合的標籤，並以串列回傳

import requests
from bs4 import BeautifulSoup
url1 = "https://mimigd.com"
html1 = requests.get(url1)
html1.encoding = "utf-8"
sp = BeautifulSoup(html1.text, "lxml")
#使用lxml模組進行解析
print(sp.title)
#列出網站含有標籤<title>的網站標題(<title>阿倫的秘密基地 - 學習記錄與教學</title>)
print(sp.title.text)
#傳回除去所有html標籤的文字內容(阿倫的秘密基地 - 學習記錄與教學)
print(sp.find("p"))
#找出p標籤，只會列出第一個找到的
print(sp.find_all("p"))
#找出全部p標籤，會以list形式傳回
print(sp.find_all("p")[5])
#也可使用索引值的方式找出特定的位置
print(sp.find_all("img", class_="attachment-medium_large size-medium_large wp-post-image"))
#因為class是保留字，所以需要下底線
print(sp.find_all("img", {"class" : "attachment-medium_large size-medium_large wp-post-image"}))
#也可使用字典的方式表示

■ select()

select()：尋找指定CSS選擇器如id或class的內容，以串列回傳，即使值有一個值也是用串列表示

import requests
from bs4 import BeautifulSoup
url1 = "https://mimigd.com"
html1 = requests.get(url1)
html1.encoding = "utf-8"
sp = BeautifulSoup(html1.text, "lxml")
#使用lxml模組進行解析
print(sp.select("title"))
#選取標籤
print(sp.select("#post-255"))
#讀取id時前面須加上#，因標籤中的id屬性不能重複，讀取時最明確，表示讀取id=post-255的內容
print(sp.select(".title"))
#讀取類別(class)時前面須加上"."，表示讀取class=title的內容
print(sp.select("html body div main div section div article a img"))
#當有多層標籤、id、或是類別時也可使用select方式逐層尋找
print(sp.select("img")[1].get("src"))
print(sp.select("a")[1].get("href"))
#如果想要回傳屬性中的內容可以使用get(屬性)來獲取

發佈留言取消回覆