亚洲无码另类视频在线,三韩国级电影在线

新聞中心

這里有您想知道的互聯(lián)網(wǎng)營(yíng)銷解決方案

四種Python爬蟲常用的定位元素方法對(duì)比，你偏愛哪一款？

在使用Python本爬蟲采集數(shù)據(jù)時(shí)，一個(gè)很重要的操作就是如何從請(qǐng)求到的網(wǎng)頁(yè)中提取數(shù)據(jù)，而正確定位想要的數(shù)據(jù)又是第一步操作。

創(chuàng)新互聯(lián)公司是一家集網(wǎng)站建設(shè),劍閣企業(yè)網(wǎng)站建設(shè),劍閣品牌網(wǎng)站建設(shè),網(wǎng)站定制,劍閣網(wǎng)站建設(shè)報(bào)價(jià),網(wǎng)絡(luò)營(yíng)銷,網(wǎng)絡(luò)優(yōu)化,劍閣網(wǎng)站推廣為一體的創(chuàng)新建站企業(yè)，幫助傳統(tǒng)企業(yè)提升企業(yè)形象加強(qiáng)企業(yè)競(jìng)爭(zhēng)力?？沙浞譂M足這一群體相比中小企業(yè)更為豐富、高端、多元的互聯(lián)網(wǎng)需求。同時(shí)我們時(shí)刻保持專業(yè)、時(shí)尚、前沿，時(shí)刻以成就客戶成長(zhǎng)自我，堅(jiān)持不斷學(xué)習(xí)、思考、沉淀、凈化自己，讓我們?yōu)楦嗟钠髽I(yè)打造出實(shí)用型網(wǎng)站。

本文將對(duì)比幾種 Python 爬蟲中比較常用的定位網(wǎng)頁(yè)元素的方式供大家學(xué)習(xí)：

傳統(tǒng) BeautifulSoup 操作
基于 BeautifulSoup 的 CSS 選擇器(與 PyQuery 類似)
XPath
正則表達(dá)式

參考網(wǎng)頁(yè)是當(dāng)當(dāng)網(wǎng)圖書暢銷總榜：

http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1

我們以獲取第一頁(yè) 20 本書的書名為例。先確定網(wǎng)站沒有設(shè)置反爬措施，是否能直接返回待解析的內(nèi)容：

 
 
 
 
  
  
  
  import requests 
  
  
  
   
  
  
  
  url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
  
  
  
  response = requests.get(url).text 
  
  
  
  print(response)

仔細(xì)檢查后發(fā)現(xiàn)需要的數(shù)據(jù)都在返回內(nèi)容中，說明不需要特別考慮反爬舉措

審查網(wǎng)頁(yè)元素后可以發(fā)現(xiàn)，書目信息都包含在 li 中，從屬于 class 為 bang_list clearfix bang_list_mode 的 ul 中

進(jìn)一步審查也可以發(fā)現(xiàn)書名在的相應(yīng)位置，這是多種解析方法的重要基礎(chǔ)

1. 傳統(tǒng) BeautifulSoup 操作

經(jīng)典的 BeautifulSoup 方法借助 from bs4 import BeautifulSoup，然后通過 soup = BeautifulSoup(html, "lxml") 將文本轉(zhuǎn)換為特定規(guī)范的結(jié)構(gòu)，利用 find 系列方法進(jìn)行解析，代碼如下：

 
 
 
 
  
  
  
  import requests 
  
  
  
  from bs4 import BeautifulSoup 
  
  
  
   
  
  
  
  url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
  
  
  
  response = requests.get(url).text 
  
  
  
   
  
  
  
  def bs_for_parse(response): 
  
  
  
      soup = BeautifulSoup(response, "lxml") 
  
  
  
      li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li') # 鎖定ul后獲取20個(gè)li 
  
  
  
      for li in li_list: 
  
  
  
          title = li.find('div', class_='name').find('a')['title'] # 逐個(gè)解析獲取書名 
  
  
  
          print(title) 
  
  
  
   
  
  
  
  if __name__ == '__main__': 
  
  
  
      bs_for_parse(response)

成功獲取了 20 個(gè)書名，有些書面顯得冗長(zhǎng)可以通過正則或者其他字符串方法處理，本文不作詳細(xì)介紹

2. 基于 BeautifulSoup 的 CSS 選擇器

這種方法實(shí)際上就是 PyQuery 中 CSS 選擇器在其他模塊的遷移使用，用法是類似的。關(guān)于 CSS 選擇器詳細(xì)語(yǔ)法可以參考：http://www.w3school.com.cn/cssref/css_selectors.asp由于是基于 BeautifulSoup 所以導(dǎo)入的模塊以及文本結(jié)構(gòu)轉(zhuǎn)換都是一致的：

 
 
 
 
  
  
  
  import requests 
  
  
  
  from bs4 import BeautifulSoup 
  
  
  
   
  
  
  
  url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
  
  
  
  response = requests.get(url).text 
  
  
  
           
  
  
  
  def css_for_parse(response): 
  
  
  
      soup = BeautifulSoup(response, "lxml")  
  
  
  
      print(soup) 
  
  
  
   
  
  
  
  if __name__ == '__main__': 
  
  
  
      css_for_parse(response)

然后就是通過 soup.select 輔以特定的 CSS 語(yǔ)法獲取特定內(nèi)容，基礎(chǔ)依舊是對(duì)元素的認(rèn)真審查分析：

 
 
 
 
  
  
  
  import requests 
  
  
  
  from bs4 import BeautifulSoup 
  
  
  
  from lxml import html 
  
  
  
   
  
  
  
  url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
  
  
  
  response = requests.get(url).text 
  
  
  
           
  
  
  
  def css_for_parse(response): 
  
  
  
      soup = BeautifulSoup(response, "lxml") 
  
  
  
      li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li') 
  
  
  
      for li in li_list: 
  
  
  
          title = li.select('div.name > a')[0]['title'] 
  
  
  
          print(title) 
  
  
  
   
  
  
  
  if __name__ == '__main__': 
  
  
  
      css_for_parse(response)

3. XPath

XPath 即為 XML 路徑語(yǔ)言，它是一種用來(lái)確定 XML 文檔中某部分位置的計(jì)算機(jī)語(yǔ)言，如果使用 Chrome 瀏覽器建議安裝 XPath Helper 插件，會(huì)大大提高寫 XPath 的效率。

之前的爬蟲文章基本都是基于 XPath，大家相對(duì)比較熟悉因此代碼直接給出：

 
 
 
 
  
  
  
  import requests 
  
  
  
  from lxml import html 
  
  
  
   
  
  
  
  url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
  
  
  
  response = requests.get(url).text 
  
  
  
   
  
  
  
  def xpath_for_parse(response): 
  
  
  
      selector = html.fromstring(response) 
  
  
  
      books = selector.xpath("http://ul[@class='bang_list clearfix bang_list_mode']/li") 
  
  
  
      for book in books: 
  
  
  
          title = book.xpath('div[@class="name"]/a/@title')[0] 
  
  
  
          print(title) 
  
  
  
   
  
  
  
  if __name__ == '__main__': 
  
  
  
      xpath_for_parse(response)

4. 正則表達(dá)式如果對(duì) HTML 語(yǔ)言不熟悉，那么之前的幾種解析方法都會(huì)比較吃力。這里也提供一種萬(wàn)能解析大法：正則表達(dá)式，只需要關(guān)注文本本身有什么特殊構(gòu)造文法，即可用特定規(guī)則獲取相應(yīng)內(nèi)容。依賴的模塊是 re

首先重新觀察直接返回的內(nèi)容中，需要的文字前后有什么特殊：

 
 
 
 
  
  
  
  import requests 
  
  
  
  import re 
  
  
  
   
  
  
  
  url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
  
  
  
  response = requests.get(url).text 
  
  
  
  print(response)

觀察幾個(gè)數(shù)目相信就有答案了：

書名就藏在上面的字符串中，蘊(yùn)含的網(wǎng)址鏈接中末尾的數(shù)字會(huì)隨著書名而改變。

分析到這里正則表達(dá)式就可以寫出來(lái)了：

 
 
 
 
  
  
  
  import requests 
  
  
  
  import re 
  
  
  
   
  
  
  
  url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
  
  
  
  response = requests.get(url).text 
  
  
  
   
  
  
  
  def re_for_parse(response): 
  
  
  
      reg = '
  
  
  
  
    for title in re.findall(reg, response): 
  
  
  
          print(title) 
  
  
  
   
  
  
  
  if __name__ == '__main__': 
  
  
  
      re_for_parse(response)

可以發(fā)現(xiàn)正則寫法是最簡(jiǎn)單的，但是需要對(duì)于正則規(guī)則非常熟練。所謂正則大法好!

當(dāng)然，不論哪種方法都有它所適用的場(chǎng)景，在真實(shí)操作中我們也需要在分析網(wǎng)頁(yè)結(jié)構(gòu)來(lái)判斷如何高效的定位元素，最后附上本文介紹的四種方法的完整代碼，大家可以自行操作一下來(lái)加深體會(huì)

 
 
 
 
  
  
  
  import requests 
  
  
  
  from bs4 import BeautifulSoup 
  
  
  
  from lxml import html 
  
  
  
  import re 
  
  
  
   
  
  
  
  url = 'http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1' 
  
  
  
  response = requests.get(url).text 
  
  
  
   
  
  
  
  def bs_for_parse(response): 
  
  
  
      soup = BeautifulSoup(response, "lxml") 
  
  
  
      li_list = soup.find('ul', class_='bang_list clearfix bang_list_mode').find_all('li') 
  
  
  
      for li in li_list: 
  
  
  
          title = li.find('div', class_='name').find('a')['title'] 
  
  
  
          print(title) 
  
  
  
   
  
  
  
  def css_for_parse(response): 
  
  
  
      soup = BeautifulSoup(response, "lxml") 
  
  
  
      li_list = soup.select('ul.bang_list.clearfix.bang_list_mode > li') 
  
  
  
      for li in li_list: 
  
  
  
          title = li.select('div.name > a')[0]['title'] 
  
  
  
          print(title) 
  
  
  
   
  
  
  
  def xpath_for_parse(response): 
  
  
  
      selector = html.fromstring(response) 
  
  
  
      books = selector.xpath("http://ul[@class='bang_list clearfix bang_list_mode']/li") 
  
  
  
      for book in books: 
  
  
  
          title = book.xpath('div[@class="name"]/a/@title')[0] 
  
  
  
          print(title) 
  
  
  
   
  
  
  
  def re_for_parse(response): 
  
  
  
      reg = '
  
  
  
  
    for title in re.findall(reg, response): 
  
  
  
          print(title) 
  
  
  
   
  
  
  
  if __name__ == '__main__': 
  
  
  
      # bs_for_parse(response) 
  
  
  
      # css_for_parse(response) 
  
  
  
      # xpath_for_parse(response) 
  
  
  
      re_for_parse(response)

分享題目：四種Python爬蟲常用的定位元素方法對(duì)比，你偏愛哪一款？
網(wǎng)頁(yè)路徑：http://m.5511xx.com/article/dpecjjp.html

日韩无码专区无码一级三级片|91人人爱网站中日韩无码电影|厨房大战丰满熟妇|AV高清无码在线免费观看|另类AV日韩少妇熟女|中文日本大黄一级黄色片|色情在线视频免费|亚洲成人特黄a片|黄片wwwav色图欧美|欧亚乱色一区二区三区

新聞中心

其他資訊