文章詳情頁

python Scrapy框架原理解析

瀏覽：131日期：2022-06-30 14:19:23

Python 爬蟲包含兩個重要的部分：正則表達式和Scrapy框架的運用，正則表達式對于所有語言都是通用的，網絡上可以找到各種資源。

如下是手繪Scrapy框架原理圖，幫助理解

python Scrapy框架原理解析

如下是一段運用Scrapy創建的spider：使用了內置的crawl模板，以利用Scrapy庫的CrawlSpider。相對于簡單的爬取爬蟲來說，Scrapy的CrawlSpider擁有一些網絡爬取時可用的特殊屬性和方法：

$ scrapy genspider country_or_district example.python-scrapying.com--template=crawl

運行genspider命令后，下面的代碼將會在example/spiders/country_or_district.py中自動生成。

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom example.items import CountryOrDistrictItemclass CountryOrDistrictSpider(CrawlSpider): name = ’country_or_district’ allowed_domains = [’example.python-scraping.com’] start_urls = [’http://example.python-scraping.com/’] rules = ( Rule(LinkExtractor(allow=r’/index/’, deny=r’/user/’), follow=True), Rule(LinkExtractor(allow=r’/view/’, deny=r’/user/’), callback=’parse_item’), ) def parse_item(self, response): item = CountryOrDistrictItem() name_css = ’tr#places_country_or_district__row td.w2p_fw::text’ item[’name’] = response.css(name_css).extract() pop_xpath = ’//tr[@id='places_population__row']/td[@class='w2p_fw']/text()’ item[’population’] = response.xpath(pop_xpath).extract() return item

爬蟲類包括的屬性：

name: 識別爬蟲的字符串。 allowed_domains: 可以爬取的域名列表。如果沒有設置該屬性，則表示可以爬取任何域名。 start_urls: 爬蟲起始URL列表。 rules: 該屬性為一個通過正則表達式定義的Rule對象元組，用于告知爬蟲需要跟蹤哪些鏈接以及哪些鏈接包含抓取的有用內容。

以上就是python Scrapy框架原理解析的詳細內容，更多關于Scrapy框架原理的資料請關注好吧啦網其它相關文章！

Python 編程

上一條：Python Selenium庫的基本使用教程下一條：如何用 Python 處理不平衡數據集

相關文章：

1. jsp+servlet簡單實現上傳文件功能（保存目錄改進）2. .Net反向代理組件Yarp用法詳解3. 解決request.getParameter取值后的if判斷為NULL的問題4. .NET Framework各版本(.NET2.0 3.0 3.5 4.0)區別5. 詳解JSP 內置對象request常見用法6. JSP中param動作的實例詳解7. ASP.NET MVC實現下拉框多選8. ASP.NET MVC增加一條記錄同時添加N條集合屬性所對應的個體9. .NET中的MassTransit分布式應用框架詳解10. ASP.NET MVC實現本地化和全球化

排行榜

					
					Android 自定義View手寫簽名并保存圖片功能
Django實現自定義路由轉換器
python flask框架快速入門
JavaScript實現字符串與HTML格式相互轉換
Android中SeekBar拖動條使用方法詳解
python求numpy中array按列非零元素的平均值案例
PHP 5.0 的變化與PHP 6.0 展望
關于Python 解決Python3.9 pandas.read_excel(‘xxx.xlsx‘)報錯的問題
淺談Python 命令行參數argparse寫入圖片路徑操作
Aliyun Linux 編譯安裝 php7.3 tengine2.3.2 mysql8.0 redis5的過程詳解
thinkphp如何傳遞GET參數方法詳解