文章詳情頁

python識別驗證碼的思路及解決方案

瀏覽：66日期：2022-07-11 13:58:14

1、介紹

在爬蟲中經(jīng)常會遇到驗證碼識別的問題，現(xiàn)在的驗證碼大多分計算驗證碼、滑塊驗證碼、識圖驗證碼、語音驗證碼等四種。本文就是識圖驗證碼，識別的是簡單的驗證碼，要想讓識別率更高，識別的更加準(zhǔn)確就需要花很多的精力去訓(xùn)練自己的字體庫。

識別驗證碼通常是這幾個步驟：

（1）灰度處理

（2）二值化

（3）去除邊框（如果有的話）

（4）降噪

（5）切割字符或者傾斜度矯正

（6）訓(xùn)練字體庫

（7）識別

這6個步驟中前三個步驟是基本的，4或者5可根據(jù)實際情況選擇是否需要。

經(jīng)常用的庫有pytesseract(識別庫)、OpenCV(高級圖像處理庫)、imagehash（圖片哈希值庫）、numpy（開源的、高性能的Python數(shù)值計算庫）、PIL的 Image，ImageDraw，ImageFile等。

2、實例

以某網(wǎng)站登錄的驗證碼識別為例：具體過程和上述的步驟稍有不同。

python識別驗證碼的思路及解決方案

首先分析一下，驗證碼是由4個從0到9等10個數(shù)字組成的，那么從0到9這個10個數(shù)字沒有數(shù)字只有第一、第二、第三和第四等4個位置。那么計算下來共有40個數(shù)字位置，如下：

python識別驗證碼的思路及解決方案

那么接下來就要對驗證碼圖片進行降噪、分隔得到上面的圖片。以這40個圖片集作為基礎(chǔ)。

對要驗證的驗證碼圖片進行降噪、分隔后獲取四個類似上面的數(shù)字圖片、通過和上面的比對就可以知道該驗證碼是什么了。

以上面驗證碼2837為例：

1、圖片降噪

python識別驗證碼的思路及解決方案

2、圖片分隔

python識別驗證碼的思路及解決方案

3、圖片比對

通過比驗證碼降噪、分隔后的四個數(shù)字圖片，和上面的40個數(shù)字圖片進行哈希值比對，設(shè)置一個誤差，max_dif：允許最大hash差值，越小越精確，最小為0。

python識別驗證碼的思路及解決方案

這樣四個數(shù)字圖片通過比較后獲取對應(yīng)是數(shù)字，連起來，就是要獲取的驗證碼。

完整代碼如下：

#coding=utf-8import osimport refrom selenium import webdriverfrom selenium.webdriver.common.keys import Keysimport timefrom selenium.webdriver.common.action_chains import ActionChainsimport collectionsimport mongoDbBaseimport numpyimport imagehashfrom PIL import Image,ImageFileimport datetimeclass finalNews_IE: def __init__(self,strdate,logonUrl,firstUrl,keyword_list,exportPath,codepath,codedir):self.iniDriver()self.db = mongoDbBase.mongoDbBase()self.date = strdateself.firstUrl = firstUrlself.logonUrl = logonUrlself.keyword_list = keyword_listself.exportPath = exportPathself.codedir = codedirself.hash_code_dict ={}for f in range(0,10): for l in range(1,5):file = os.path.join(codedir, 'codeLibrarycode' + str(f) + ’_’+str(l) + '.png')# print(file)hash = self.get_ImageHash(file)self.hash_code_dict[hash]= str(f) def iniDriver(self):# 通過配置文件獲取IEDriverServer.exe路徑IEDriverServer = 'C:Program FilesInternet ExplorerIEDriverServer.exe'os.environ['webdriver.ie.driver'] = IEDriverServerself.driver = webdriver.Ie(IEDriverServer) def WriteData(self, message, fileName):fileName = os.path.join(os.getcwd(), self.exportPath + ’/’ + fileName)with open(fileName, ’a’) as f: f.write(message) # 獲取圖片文件的hash值 def get_ImageHash(self,imagefile):hash = Noneif os.path.exists(imagefile): with open(imagefile, ’rb’) as fp:hash = imagehash.average_hash(Image.open(fp))return hash # 點降噪 def clearNoise(self, imageFile, x=0, y=0):if os.path.exists(imageFile): image = Image.open(imageFile) image = image.convert(’L’) image = numpy.asarray(image) image = (image > 135) * 255 image = Image.fromarray(image).convert(’RGB’) # save_name = 'D:workpython36_crawlVeriycodemode_5590.png' # image.save(save_name) image.save(imageFile) return image #切割驗證碼 # rownum：切割行數(shù)；colnum：切割列數(shù)；dstpath：圖片文件路徑；img_name：要切割的圖片文件 def splitimage(self, imagePath,imageFile,rownum=1, colnum=4):img = Image.open(imageFile)w, h = img.sizeif rownum <= h and colnum <= w: print(’Original image info: %sx%s, %s, %s’ % (w, h, img.format, img.mode)) print(’開始處理圖片切割, 請稍候...’) s = os.path.split(imageFile) if imagePath == ’’:dstpath = s[0] fn = s[1].split(’.’) basename = fn[0] ext = fn[-1] num = 1 rowheight = h // rownum colwidth = w // colnum file_list =[] for r in range(rownum):index = 0for c in range(colnum): # (left, upper, right, lower) # box = (c * colwidth, r * rowheight, (c + 1) * colwidth, (r + 1) * rowheight) if index < 1:colwid = colwidth + 6 elif index < 2:colwid = colwidth + 1 elif index < 3:colwid = colwidth box = (c * colwid, r * rowheight, (c + 1) * colwid, (r + 1) * rowheight) newfile = os.path.join(imagePath, basename + ’_’ + str(num) + ’.’ + ext) file_list.append(newfile) img.crop(box).save(newfile, ext) num = num + 1 index += 1 return file_list def compare_image_with_hash(self, image_hash1,image_hash2, max_dif=0):'''max_dif: 允許最大hash差值, 越小越精確,最小為0推薦使用'''dif = image_hash1 - image_hash2# print(dif)if dif < 0: dif = -difif dif <= max_dif: return Trueelse: return False # 截取驗證碼圖片 def savePicture(self):self.driver.get(self.logonUrl)self.driver.maximize_window()time.sleep(1)self.driver.save_screenshot(self.codedir +'Temp.png')checkcode = self.driver.find_element_by_id('checkcode')location = checkcode.location # 獲取驗證碼x,y軸坐標(biāo)size = checkcode.size # 獲取驗證碼的長寬rangle = (int(location[’x’]), int(location[’y’]), int(location[’x’] + size[’width’]), int(location[’y’] + size[’height’])) # 寫成我們需要截取的位置坐標(biāo)i = Image.open(self.codedir +'Temp.png') # 打開截圖result = i.crop(rangle) # 使用Image的crop函數(shù)，從截圖中再次截取我們需要的區(qū)域filename = datetime.datetime.now().strftime('%M%S')filename =self.codedir +'Temp_code.png'result.save(filename)self.clearNoise(filename)file_list = self.splitimage(self.codedir,filename)verycode =’’for f in file_list: imageHash = self.get_ImageHash(f) for h,code in self.hash_code_dict.items():flag = self.compare_image_with_hash(imageHash,h,0)if flag: # print(code) verycode+=code breakprint(verycode)self.driver.close() def longon(self):self.driver.get(self.logonUrl)self.driver.maximize_window()time.sleep(1)self.savePicture()accname = self.driver.find_element_by_id('username')# accname = self.driver.find_element_by_id('//input[@id=’username’]')accname.send_keys(’ctrchina’)accpwd = self.driver.find_element_by_id('password')# accpwd.send_keys(’123456’)code = self.getVerycode()checkcode = self.driver.find_element_by_name('checkcode')checkcode.send_keys(code)submit = self.driver.find_element_by_name('button')submit.click()

實例補充：

# -*- coding: utf-8 -*import sysreload(sys)sys.setdefaultencoding( 'utf-8' )import reimport requestsimport ioimport osimport jsonfrom PIL import Imagefrom PIL import ImageEnhancefrom bs4 import BeautifulSoupimport mdataclass Student: def __init__(self, user,password): self.user = str(user) self.password = str(password) self.s = requests.Session() def login(self): url = 'http://202.118.31.197/ACTIONLOGON.APPPROCESS?mode=4' res = self.s.get(url).text imageUrl = ’http://202.118.31.197/’+re.findall(’<img src='http://www.intensediesel.com/bcjs/(.+?)' width='55'’,res)[0] im = Image.open(io.BytesIO(self.s.get(imageUrl).content)) enhancer = ImageEnhance.Contrast(im) im = enhancer.enhance(7) x,y = im.size for i in range(y): for j in range(x): if (im.getpixel((j,i))!=(0,0,0)): im.putpixel((j,i),(255,255,255)) num = [6,19,32,45] verifyCode = '' for i in range(4): a = im.crop((num[i],0,num[i]+13,20)) l=[] x,y = a.size for i in range(y): for j in range(x): if (a.getpixel((j,i))==(0,0,0)): l.append(1) else: l.append(0) his=0 chrr=''; for i in mdata.data: r=0; for j in range(260): if(l[j]==mdata.data[i][j]): r+=1 if(r>his): his=r chrr=i verifyCode+=chrr # print '輔助輸入驗證碼完畢:',verifyCode data= { ’WebUserNO’:str(self.user), ’Password’:str(self.password), ’Agnomen’:verifyCode, } url = 'http://202.118.31.197/ACTIONLOGON.APPPROCESS?mode=4' t = self.s.post(url,data=data).text if re.findall('images/Logout2',t)==[]: l = ’[0,'’+re.findall(’alert((.+?));’,t)[1][1][2:-2]+’']’+' '+self.user+' '+self.password+'n' # print l # return ’[0,'’+re.findall(’alert((.+?));’,t)[1][1][2:-2]+’']’ return [False,l] else: l = ’登錄成功 ’+re.findall(’! (.+?) ’,t)[0]+' '+self.user+' '+self.password+'n' # print l return [True,l] def getInfo(self): imageUrl = ’http://202.118.31.197/ACTIONDSPUSERPHOTO.APPPROCESS’ data = self.s.get(’http://202.118.31.197/ACTIONQUERYBASESTUDENTINFO.APPPROCESS?mode＝3’).text #學(xué)籍信息 data = BeautifulSoup(data,'lxml') q = data.find_all('table',attrs={’align’:'left'}) a = [] for i in q[0]: if type(i)==type(q[0]) : for j in i : if type(j) ==type(i): a.append(j.text) for i in q[1]: if type(i)==type(q[1]) : for j in i : if type(j) ==type(i): a.append(j.text) data = {} for i in range(1,len(a),2): data[a[i-1]]=a[i] # data[’照片’] = io.BytesIO(self.s.get(imageUrl).content) return json.dumps(data) def getPic(self): imageUrl = ’http://202.118.31.197/ACTIONDSPUSERPHOTO.APPPROCESS’ pic = Image.open(io.BytesIO(self.s.get(imageUrl).content)) return pic def getScore(self): score = self.s.get(’http://202.118.31.197/ACTIONQUERYSTUDENTSCORE.APPPROCESS’).text #成績單 score = BeautifulSoup(score, 'lxml') q = score.find_all(attrs={’height’:'36'})[0] point = q.text print point[point.find(’平均學(xué)分績點’):] table = score.html.body.table people = table.find_all(attrs={’height’ : ’36’})[0].string r = table.find_all(’table’,attrs={’align’ : ’left’})[0].find_all(’tr’) subject = [] lesson = [] for i in r[0]: if type(r[0])==type(i): subject.append(i.string) for i in r: k=0 temp = {} for j in i: if type(r[0])==type(j): temp[subject[k]] = j.string k+=1 lesson.append(temp) lesson.pop() lesson.pop(0) return json.dumps(lesson) def logoff(self): return self.s.get(’http://202.118.31.197/ACTIONLOGOUT.APPPROCESS’).textif __name__ == '__main__': a = Student(20150000,20150000) r = a.login() print r[1] if r[0]: r = json.loads(a.getScore()) for i in r: for j in i: print i[j], print q = json.loads(a.getInfo()) for i in q: print i,q[i] a.getPic().show() a.logoff()

到此這篇關(guān)于python識別驗證碼的思路及解決方案的文章就介紹到這了,更多相關(guān)python識別驗證碼的思路是什么內(nèi)容請搜索好吧啦網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持好吧啦網(wǎng)！

Python 編程

上一條：python文件排序的方法總結(jié)下一條：Python實現(xiàn)敏感詞過濾的4種方法

相關(guān)文章：

1. ASP中if語句、select 、while循環(huán)的使用方法2. html小技巧之td,div標(biāo)簽里內(nèi)容不換行3. xml中的空格之完全解說4. ASP中解決“對象關(guān)閉時,不允許操作。”的詭異問題……5. XML入門的常見問題(四)6. php bugs代碼審計基礎(chǔ)詳解7. ASP使用MySQL數(shù)據(jù)庫的方法8. ASP動態(tài)網(wǎng)頁制作技術(shù)經(jīng)驗分享9. WMLScript的語法基礎(chǔ)10. msxml3.dll 錯誤 800c0019 系統(tǒng)錯誤:-2146697191解決方法

排行榜

					
					SSM框架整合之Spring+SpringMVC+MyBatis實踐步驟
Java 生成帶Logo和文字的二維碼
SQL2000 關(guān)于 Java JDBC 驅(qū)動的安裝和設(shè)定
springboot與springmvc基礎(chǔ)入門講解
Android 使用騰訊X5瀏覽器上傳圖片的示例
解決vue3報錯:找不到模塊或其相應(yīng)的類型聲明
java 實現(xiàn)反射 json動態(tài)轉(zhuǎn)實體類--fastjson
基于Python實現(xiàn)全自動下載抖音視頻
springboot配置https安全連接的方法
前后端ajax和json數(shù)據(jù)交換方式
Windows Phone 支持 Android 應(yīng)用程序？來看看第三方開發(fā)者怎么說