文章詳情頁

python - 爬蟲模擬登錄后，爬取csdn后臺文章列表遇到的問題

瀏覽：135日期：2022-07-17 10:39:13

問題描述

爬蟲確實已經登錄進去了，因為我爬取個人信息是可以抓出來的，但是下圖的這個網址抓不出來：

網址是：http://write.blog.csdn.net/postlist，就是你的csdn后臺。

我貼下代碼吧，py2.7的

#!/usr/bin/env python# -*- coding: utf-8 -*-from bs4 import BeautifulSoupimport requestsclass CSDN(object): def __init__(self, headers):self.session = requests.Session()self.headers = headers def get_webflow(self):url = ’http://passport.csdn.net/account/login’response = self.session.get(url=url, headers=self.headers)soup = BeautifulSoup(response.text, ’html.parser’)lt = soup.find(’input’, {’name’: ’lt’})[’value’]execution = soup.find(’input’, {’name’: ’execution’})[’value’]soup.clear()return (lt, execution) def login(self, account, password):self.username = accountself.password = passwordlt, execution = self.get_webflow()data = { ’username’: account, ’password’: password, ’lt’: lt, ’execution’: execution, ’_eventId’: ’submit’}url = ’http://passport.csdn.net/account/login’response = self.session.post(url=url, headers=self.headers, data=data)if (response.status_code == 200): print(’正常’)else: print(’異常’) def func(self):headers1={ ’Host’:’write.blog.csdn.net’, ’Upgrade-Insecure-Requests’:’1’, ’User-Agent’:’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36’}response=self.session.get(url=’http://write.blog.csdn.net/postlist’,headers=headers1,allow_redirects=False)print response.textif __name__ == ’__main__’: headers = {’Host’: ’passport.csdn.net’,’Origin’: ’http://passport.csdn.net’,’Referer’:’http://passport.csdn.net/account/login’,’Upgrade-Insecure-Requests’:’1’,’User-Agent’: ’Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36’, } csdn = CSDN(headers=headers) account = ’’ password = ’’ csdn.login(account=account, password=password) csdn.func()

上面的代碼輸出是

正常<html><head><title>Object moved</title></head><body><h2>Object moved to <a >here</a>.</h2></body></html>

問題解答

回答1：

因為這個地址返回的是一個302跳轉，你要根據返回header的Location繼續(xù)請求，再分析返回的內容繼續(xù)處理，瀏覽器幫你做了這些302跳轉和執(zhí)行返回的js等內容，手工抓取就需要自己處理．

回答2：

直接用cookie即可

Python 編程

上一條：python 3.4 error: Microsoft Visual C++ 10.0 is required下一條：python - scrapy中返回函數(shù)的返回值

排行榜

					
					python 計算兩個時間相差的分鐘數(shù)，超過一天時計算不對
javascript - 使用form進行頁面跳轉，但是很慢，如何加一個Loading？
docker-machine添加一個已有的docker主機問題
docker-compose中volumes的問題
angular.js - 輸入郵箱地址之后， 如何使其自動在末尾添加分號？
javascript - 后臺管理系統(tǒng)左側折疊導航欄數(shù)據較多，怎么樣直接通過搜索去定位到具體某一個菜單項位置，并展開當前菜單
javascript - html5的data屬性怎么指定一個function函數(shù)呢？
javascript - ES6規(guī)范下 repeat 函數(shù)報錯 Invalid count value
在mac下出現(xiàn)了兩個docker環(huán)境
javascript - 如何使用nodejs 將.html 文件轉化成canvas
angular.js - angularjs 注入模塊報錯 很怪異... 求解惑
				

熱門標簽

国产成人精品亚洲777人妖,欧美日韩精品一区视频,最新亚洲国产,国产乱码精品一区二区亚洲

python - 爬蟲模擬登錄后，爬取csdn后臺文章列表遇到的問題