While playing with OpenCV, an idea quickly came to my mind. Can OpenCV help to bypass captcha engines ? The answer is mitigated.
Old captcha engine can be bypassed easily but it is not an exact science and if you except in this article to know how to bypass
the Google re-Captcha engine I prefer tell you I didn’t even tried ! Re-captcha is from my point of view the best. Anyway
let’s see what we can do with some basic engines. As OCR engine I have used the open source tesseract and my pytesser module
(my pytesser implementation.
The project is hosted on Github.
Basic functions
In order to process real captchas I have written some functions that use OpenCV. This functions are basically the same than the OpenCV smooth, dilate, erode and so on but to which you can specify the number of rounds. So it is for example easy to apply 10 or more smooth in a row. An interesting function is getIndividualContoursRectangle which return a list of rectangle coordinate of the detected contours.
Note: I have also written a small example in the main the show how to use this functions.
#-*- coding:utf-8 -*-importcv2.cvascvimportpytesserdefsmoothImage(im,nbiter=0,filter=cv.CV_GAUSSIAN):foriinrange(nbiter):cv.Smooth(im,im,filter)defopenCloseImage(im,nbiter=0):foriinrange(nbiter):cv.MorphologyEx(im,im,None,None,cv.CV_MOP_OPEN)#Open and close to make appear contourscv.MorphologyEx(im,im,None,None,cv.CV_MOP_CLOSE)defdilateImage(im,nbiter=0):foriinrange(nbiter):cv.Dilate(im,im)deferodeImage(im,nbiter=0):foriinrange(nbiter):cv.Erode(im,im)defthresholdImage(im,value,filter=cv.CV_THRESH_BINARY_INV):cv.Threshold(im,im,value,255,filter)defresizeImage(im,(width,height)):#It appears to me that resize an image can be significant for the ocr engine to detect charactersres=cv.CreateImage((width,height),im.depth,im.channels)cv.Resize(im,res)returnresdefgetContours(im,approx_value=1):#Return contours approximatedstorage=cv.CreateMemStorage(0)contours=cv.FindContours(cv.CloneImage(im),storage,cv.CV_RETR_CCOMP,cv.CV_CHAIN_APPROX_SIMPLE)contourLow=cv.ApproxPoly(contours,storage,cv.CV_POLY_APPROX_DP,approx_value,approx_value)returncontourLowdefgetIndividualContoursRectangles(contours):#Return the bounding rect for every contourscontourscopy=contoursrectangleList=[]whilecontourscopy:x,y,w,h=cv.BoundingRect(contourscopy)rectangleList.append((x,y,w,h))contourscopy=contourscopy.h_next()returnrectangleListif__name__=="__main__":orig=cv.LoadImage("robin2.png")#Convert in black and whiteres=cv.CreateImage(cv.GetSize(orig),8,1)cv.CvtColor(orig,res,cv.CV_BGR2GRAY)#Operations on the imageopenCloseImage(res)dilateImage(res,2)erodeImage(res,2)smoothImage(res,5)thresholdImage(res,150,cv.CV_THRESH_BINARY_INV)#Get contours approximatedcontourLow=getContours(res,3)#Draw them on an empty imagefinal=cv.CreateImage(cv.GetSize(res),8,1)cv.Zero(final)cv.DrawContours(final,contourLow,cv.Scalar(255),cv.Scalar(255),2,cv.CV_FILLED)cv.ShowImage("orig",orig)cv.ShowImage("image",res)cv.SaveImage("modified.png",res)cv.ShowImage("contour",final)cv.SaveImage("contour.png",final)cv.WaitKey(0)
Examples:
Original image:
After processing:
After contour approximation:
Captcha downloader
To do tests at a larger scale I needed to be able to download multiple captcha images because doing it by hand is too boring. That’s why I have written a class that allow to download captcha images by an automated manner. Basically the class the class takes two arguments the url of the page were the captcha is located and the pattern in the image url that will be searched in any img tags. This module also provide a function called “setup_Benchtest” that create the envirronement automatically, create a folder and download twenty images of the given url the the newly created directory. If the directory exists all the images in it are deleted and new are downloaded. You will see a practical usage of the class for Ebay captchas.
importurllibimportrefromHTMLParserimportHTMLParserfromStringIOimportStringIOfromPILimportImageimportcv2.cvascvimportosclassCaptcha_Downloader():classMyHTMLParser(HTMLParser):#This parser will try to find the given pattern an return the captcha urldef__init__(self,pattern):HTMLParser.__init__(self)self.image_url=Noneself.pattern=patterndefhandle_starttag(self,tag,attrs):iftag=="img":forattrinattrs:ifattr[0]=="src":ifre.search(self.pattern,attr[1]):self.image_url=attr[1]defgetLink(self):returnself.image_urldef__init__(self,url,pattern,encoding=None):self.url=urlself.encoding=encodingself.parser=self.MyHTMLParser(pattern)self.image_url=Noneself.imagestr=Noneself.image=Nonedefrun(self):f=urllib.urlopen(self.url)#Open registration formifself.encodingisNone:txt=f.read()#get pageelse:txt=f.read().decode(self.encoding)self.parser.feed(txt)#Parse HTML to get image urlself.image_url=self.parser.getLink()f=urllib.urlopen(self.image_url)#Open image urlself.imagestr=f.read()#Read itself.string_to_iplimage(self.imagestr)#Convert imagedefstring_to_iplimage(self,im):#Convert the image return by urllib into an OpenCV imagepilim=StringIO(im)source=Image.open(pilim).convert("RGB")self.image=cv.CreateImageHeader(source.size,cv.IPL_DEPTH_8U,3)cv.SetData(self.image,source.tostring())cv.CvtColor(self.image,self.image,cv.CV_RGB2BGR)defgetImage(self):returnself.imagedefsetup_Benchtest(dir,url,pattern,encoding=None):#Create a folder with multiples imagesifos.path.exists(dir):forfileinos.listdir(dir):#Remove all files of the dir if there's anyos.remove(os.path.join(dir,file))os.removedirs(dir)os.mkdir(dir)dl=Captcha_Downloader(url,pattern,encoding)#Create the downloader onceforiinrange(20):#Download 20 imagedl.run()im=dl.getImage()cv.SaveImage(os.path.join(dir,dir+str(i)+".png"),im)
Ebay captcha
I have to choosen to work on Ebay captcha because at the first sight they are quite simple, but it is not inevitably the case as we will see. But in fact there is no real reasons. To setup the test envirronement I just use the captcha downloader described above with the registration form url, and the pattern in the image url which is always “LoadBotImage”.
Note: I have noticed that ebay.com does not seems touse captcha to register a new user while ebay.fr does. Moreover the captcha is contained in an iframe, and this is the url of this iframe that you should provide to the downloader.
Note again that this iframe url is generated and is not longer valid after a while.
Once the environment is set we can process all the images to release numbers contours and delete the noise. To processing code is contained in the crack function and it basically apply successively:
resizeImage: to increase the image size by 6 and get better results for the following operations
dilateImage: Applied 4 times to delete noise
erodeImage: Applied 4 times to recover from the dilatation
thresholdImage: To keep interesing pixels
Note: The crack function implemented can also return the contour only version of the image with a rough approximation of contours.
importurllibimportrefromHTMLParserimportHTMLParserimportpytesserfromStringIOimportStringIOfromPILimportImageimportcv2.cvascvimportosfromcaptcha_downloaderimportsetup_Benchtestfromgeneric_ocr_operationsimport*defcrack(tocrack,withContourImage=False):#Function that intent to release all characters on the image so that the ocr can detect them#We just apply 4 filters but with multiples roundsresized=resizeImage(tocrack,(tocrack.width*6,tocrack.height*6))dilateImage(resized,4)erodeImage(resized,4)thresholdImage(resized,200,cv.CV_THRESH_BINARY)ifwithContourImage:#If we want the image made only with contourscontours=getContours(resized,5)contourimage=cv.CreateImage(cv.GetSize(resized),8,3)cv.Zero(contourimage)cv.DrawContours(contourimage,contours,cv.Scalar(255),cv.Scalar(255),2,cv.CV_FILLED)contourimage=resizeImage(contourimage,cv.GetSize(tocrack))resized=resizeImage(resized,cv.GetSize(tocrack))returnresized,contourimageresized=resizeImage(resized,cv.GetSize(tocrack))returnresizeddefprocess_all(results):dir="Ebay"#Consider that all images are stored in the dir 'Ebay'forfile,rinzip(os.listdir(dir),results):im=cv.LoadImage(os.path.join(dir,file),cv.CV_LOAD_IMAGE_GRAYSCALE)#Load the imageim=crack(im)#intent to crack itres=pytesser.iplimage_to_string(im,psm=pytesser.PSM_SINGLE_WORD)#Do characters recognitionres=res[:-2]#Remove the two \n\n always put at the end of the resultifres==r:#Compare the result of the value contained in our listprintfile+": "+res+" | "+r+" OK"else:printfile+": "+res+" | "+r+" NO"if__name__=="__main__":#Execute the following once to setup the envirronement''' dir = "Ebay" url = "https://scgi.ebay.fr/ws/eBayISAPI.dll?FetchCaptchaToken&parentPage=RegisterEnterInfo&tokenString=5WWZNQcAAAA%3D&ej2child=true" pattern = "LoadBotImage" setup_Benchtest(dir, url, pattern) '''#The list contains to got results for my benchresults=["139866","400961","387740","431418","750113","572574","155885","440543","826316","233388","840189","349093","751181","270699","356535","743987","467643","342527","992978","879970",""]process_all(results)#Process all the images
Results:
The table below show the results obtained with tests. On the left column the captcha image. The following column are:
Without processing: This is the result obtained doing applying the ocr engine on the original image. This gives 20% of accuracy.
After processing: Is the result obtained applying the ocr engine after the image processing. The result is 30% accurate
After processing with contours: This is the result applying the ocr engine on the contour version of the image. As we can see results are less accurate with 25%.
As we can see results are not satisfying at all, that’s why I have elaborated a probabilistic method presented in the next section.
Probabilistic cracking
I have elaborated this method with all the results obtained during my test. The fact is for a specific image I almost always find the right parameters to get a nice thresholded image that tesseract will recognise gently. The problem is this parameters change from an image to another. So the idea is to try multiples parameters for the same images, mix the results together to obtain the right recognition.
Note: The algorithm relies on fact that are highly dependent of the captcha engine. Here I based my alogrithm on the fact that characters are always numbers and they are always only 6 numbers.
The core loop which tries all the the different parameters is:
12345
fordilatein[1,3,4,5]:forerodein[1,3,4,5]:forthreshin[125,150,175,200]:forsizein[(int(w*0.5),int(h*0.5)),(w,h),(w*2,h*2),(w*3,h*3)]:val=self.crack(dilate,erode,thresh,size)#Call crack successively all parameters
So it will try 4 different numbers of dilate round, 4 different number of erode round, 4 different threshold values and 4 different size of images (which matter for detection). So 256 different images will b obtained and so 256 different values.
With this matters in mind the algorithm works as follow:
The image is processed with a set of values
The string retreived from tesserect is split. The first character is sent in a dictionnary that hold all the first characters, and the second is sent in a dictionnary that hold all second characters .. (If a number is not numerics it is not added to dictionnaries
At the end the class take the most occurring character of every dictionnaries and reconstitute a string of 6 numbers which constitute the final value.
For example at the end of the analysis of the first image “Ebay0.png” dictionnaries contains:
And the result is : 198555 (which is wrong). As we can see there the result for the first character is obvious, but for all the others there is a balance between multiples numbers.
importpytesserimportosimportcv2.cvascvfromgeneric_ocr_operationsimport*classProbabilisticCracker():def__init__(self,im):self.image=imself.values=[{},{},{},{},{},{}]#Will hold occurrence of charactersself.ocrvalue=""#Final valuedefgetValue(self):returnself.ocrvalue#Return the final valuedefcrack(self,dilateiter=4,erodeiter=4,threshold=200,size=(155,55)):#Take all parametersresized=resizeImage(self.image,(self.image.width*6,self.image.height*6))dilateImage(resized,dilateiter)erodeImage(resized,erodeiter)thresholdImage(resized,threshold,cv.CV_THRESH_BINARY)resized=resizeImage(resized,size)#Call the tesseract engineret=pytesser.iplimage_to_string(resized)ret=ret[:-2]returnretdefrun(self):#Main methodw=self.image.widthh=self.image.heightfordilatein[1,3,4,5]:forerodein[1,3,4,5]:forthreshin[125,150,175,200]:forsizein[(int(w*0.5),int(h*0.5)),(w,h),(w*2,h*2),(w*3,h*3)]:val=self.crack(dilate,erode,thresh,size)#Call crack successively all parameters#print "Val:",valself.accumulateChars(val)#Call accumulateself.postAnalysis()defaccumulateChars(self,val):l=len(val)foriinrange(6):#Only iterate the 6 first charsifi>l-1:#Break the length of the string lowerbreakc=val[i]ifc.isdigit():#Put the char only if this is a digitifself.values[i].has_key(c):self.values[i][c]+=1#Add 1 if the entry character already existselse:self.values[i][c]=1defpostAnalysis(self):#Analyse at the endforvalsinself.values:#For every dictionnaryc,v=self.max(vals)#Take the most occuring#print "Max:", c,vself.ocrvalue+=c#Append it to the final stringdefmax(self,d):m=0elt=''fork,vind.items():ifv>m:m=velt=kreturnelt,mdefprocess_all(results):dir="Ebay"forfile,rinzip(os.listdir(dir),results):#For every file in the directoryim=cv.LoadImage(os.path.join(dir,file),cv.CV_LOAD_IMAGE_GRAYSCALE)#Open the filecracker=ProbabilisticCracker(im)#Instantiate the ProbabilistricCrackercracker.run()#Run itres=cracker.getValue()#Take the final valueifres==r:#Compare it with the right oneprintfile+": "+res+" | "+r+" OK"else:printfile+": "+res+" | "+r+" NO"nb=6count=0forc1,c2inzip(res,r):#Make a char/char comparison to compute the accuracyifc1==c2:count+=1print"Avg: ",(count*100)/nb,"%"if__name__=="__main__":results=["139866","400961","387740","431418","750113","572574","155885","440543","826316","233388","840189","349093","751181","270699","356535","743987","467643","342527","992978","879970",""]process_all(results)
Results:
The table shown below shows the results of the probabilistic algorithm andthe accuracy. As we can see the algorithm detect more coherent suites of numbers but the overall average of good results is not really better.
Conclusion
When you attack a single captacha it always end up by working adjusting all parameters filters, round and so on, but when you try to elaborate a generic algorithm for a given captcha engine it is far more complex even for simple captcha as the Ebay captcha engine. But as we can see the probabilistic way give better results far more accurate but still not satisfiying due to the limitation of the OCR engine. As you can see the tesseract engine is really capricious (and does not seems to be the best).
With today’s captcha engines more than character it is more shape recoginition that should be done. So a good idea would be for a given captcha engine keep in a databse multiples instances of every characters. Then to decrypt a captcha every characters should be split and the shape compared with the registered elements in the database to find the right character. Obviously using this method makes OCR engines useless.