Cracking Basic Captchas With OpenCV

Introduction

While playing with OpenCV, an idea quickly came to my mind. Can OpenCV help to bypass captcha engines ? The answer is mitigated. Old captcha engine can be bypassed easily but it is not an exact science and if you except in this article to know how to bypass the Google re-Captcha engine I  prefer tell you I didn’t even tried ! Re-captcha is from my point of view the best. Anyway let’s see what we can do with some basic engines. As OCR engine I have used the open source tesseract and my pytesser module (my pytesser implementation. The project is hosted on Github.

Basic functions

In order to process real captchas I have written some functions that use OpenCV. This functions are basically the same than the OpenCV smooth, dilate, erode and so on but to which you can specify the number of rounds. So it is for example easy to apply 10 or more smooth in a row. An interesting function is getIndividualContoursRectangle which return a list of rectangle coordinate of the detected contours. Note: I have also written a small example in the main the show how to use this functions.

The code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#-*- coding:utf-8 -*-
import cv2.cv as cv
import pytesser

def smoothImage(im, nbiter=0, filter=cv.CV_GAUSSIAN):
    for i in range(nbiter):
        cv.Smooth(im, im, filter)

def openCloseImage(im, nbiter=0):
    for i in range(nbiter):
        cv.MorphologyEx(im, im, None, None, cv.CV_MOP_OPEN) #Open and close to make appear contours
        cv.MorphologyEx(im, im, None, None, cv.CV_MOP_CLOSE)

def dilateImage(im, nbiter=0):
    for i in range(nbiter):
        cv.Dilate(im, im)

def erodeImage(im, nbiter=0):
    for i in range(nbiter):
        cv.Erode(im, im)

def thresholdImage(im, value, filter=cv.CV_THRESH_BINARY_INV):
    cv.Threshold(im, im, value, 255, filter)

def resizeImage(im, (width, height)):
    #It appears to me that resize an image can be significant for the ocr engine to detect characters
    res = cv.CreateImage((width,height), im.depth, im.channels)
    cv.Resize(im, res)
    return res

def getContours(im, approx_value=1): #Return contours approximated
    storage = cv.CreateMemStorage(0)
    contours = cv.FindContours(cv.CloneImage(im), storage, cv.CV_RETR_CCOMP, cv.CV_CHAIN_APPROX_SIMPLE)
    contourLow=cv.ApproxPoly(contours, storage, cv.CV_POLY_APPROX_DP,approx_value,approx_value)
    return contourLow

def getIndividualContoursRectangles(contours): #Return the bounding rect for every contours
    contourscopy = contours
    rectangleList = []
    while contourscopy:
        x,y,w,h = cv.BoundingRect(contourscopy)
        rectangleList.append((x,y,w,h))
        contourscopy = contourscopy.h_next()
    return rectangleList


if __name__=="__main__":
    orig = cv.LoadImage("robin2.png")
    #Convert in black and white
    res = cv.CreateImage(cv.GetSize(orig), 8, 1)
    cv.CvtColor(orig, res, cv.CV_BGR2GRAY)

    #Operations on the image
    openCloseImage(res)
    dilateImage(res, 2)
    erodeImage(res, 2)
    smoothImage(res, 5)
    thresholdImage(res, 150, cv.CV_THRESH_BINARY_INV)
    
    #Get contours approximated
    contourLow = getContours(res, 3)
    
    #Draw them on an empty image
    final = cv.CreateImage(cv.GetSize(res), 8, 1)
    cv.Zero(final)
    cv.DrawContours(final, contourLow, cv.Scalar(255), cv.Scalar(255), 2, cv.CV_FILLED)    
    
    cv.ShowImage("orig", orig)
    cv.ShowImage("image", res)
    cv.SaveImage("modified.png", res)
    cv.ShowImage("contour", final)
    cv.SaveImage("contour.png", final)
    
    cv.WaitKey(0)

Examples:

Original image:

After processing:

After contour approximation:

Captcha downloader

To do tests at a larger scale I needed to be able to download multiple captcha images because doing it by hand is too boring. That’s why I have written a class that allow to download captcha images by an automated manner. Basically the class the class takes two arguments the url of the page were the captcha is located and the pattern in the image url that will be searched in any img tags. This module also provide a function called “setup_Benchtest” that create the envirronement automatically, create a folder and download twenty images of the given url the the newly created directory. If the directory exists all the images in it are deleted and new are downloaded. You will see a practical usage of the class for Ebay captchas.

The code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import urllib
import re
from HTMLParser import HTMLParser
from StringIO import StringIO
from PIL import Image
import cv2.cv as cv
import os

class Captcha_Downloader():

    class MyHTMLParser(HTMLParser):
    #This parser will try to find the given pattern an return the captcha url
        def __init__(self,pattern):
            HTMLParser.__init__(self)
            self.image_url = None
            self.pattern =pattern

        def handle_starttag(self, tag, attrs):
            if tag == "img":
                for attr in attrs:
                    if attr[0] == "src":
                        if re.search(self.pattern, attr[1]):
                            self.image_url = attr[1]

        def getLink(self):
            return self.image_url

    def __init__(self, url, pattern, encoding=None):
        self.url = url
        self.encoding = encoding
        self.parser = self.MyHTMLParser(pattern)
        self.image_url = None
        self.imagestr = None
        self.image= None

    def run(self):
        f = urllib.urlopen(self.url) #Open registration form
        if self.encoding is None:
            txt = f.read()#get page
        else:
            txt = f.read().decode(self.encoding)

        self.parser.feed(txt) #Parse HTML to get image url

        self.image_url = self.parser.getLink()

        f = urllib.urlopen(self.image_url) #Open image url
        self.imagestr = f.read() #Read it


        self.string_to_iplimage(self.imagestr)#Convert image


    def string_to_iplimage(self, im):
    #Convert the image return by urllib into an OpenCV image
        pilim = StringIO(im)
        source = Image.open(pilim).convert("RGB")

        self.image = cv.CreateImageHeader(source.size, cv.IPL_DEPTH_8U, 3)
        cv.SetData(self.image, source.tostring())
        cv.CvtColor(self.image, self.image, cv.CV_RGB2BGR)

    def getImage(self):
        return self.image


def setup_Benchtest(dir, url, pattern, encoding=None):#Create a folder with multiples images
    if os.path.exists(dir):
        for file in os.listdir(dir): #Remove all files of the dir if there's any
            os.remove(os.path.join(dir,file))
        os.removedirs(dir)
    os.mkdir(dir)
    dl = Captcha_Downloader(url, pattern, encoding) #Create the downloader once
    for i in range(20): #Download 20 image
        dl.run()
        im = dl.getImage()
        cv.SaveImage(os.path.join(dir,dir+str(i)+".png"), im)

Ebay captcha

I have to choosen to work on Ebay captcha because at the first sight they are quite simple, but it is not inevitably the case as we will see. But in fact there is no real reasons. To setup the test envirronement I just use the captcha downloader described above with the registration form url, and the pattern in the image url which is always “LoadBotImage”.

Note: I have noticed that ebay.com does not seems touse captcha to register a new user while ebay.fr does. Moreover the captcha is contained in an iframe, and this is the url of this iframe that you should provide to the downloader.

Note again that this iframe url is generated and is not longer valid after a while.

Once the environment is set we can process all the images to release numbers contours and delete the noise. To processing code is contained in the crack function and it basically apply successively:

  • resizeImage: to increase the image size by 6 and get better results for the following operations
  • dilateImage: Applied 4 times to delete noise
  • erodeImage: Applied 4 times to recover from the dilatation
  • thresholdImage: To keep interesing pixels

Note: The crack function implemented can also return the contour only version of the image with a rough approximation of contours.

The code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import urllib
import re
from HTMLParser import HTMLParser
import pytesser
from StringIO import StringIO
from PIL import Image
import cv2.cv as cv
import os
from captcha_downloader import setup_Benchtest
from generic_ocr_operations import *

def crack(tocrack,withContourImage=False):
    #Function that intent to release all characters on the image so that the ocr can detect them

    #We just apply 4 filters but with multiples rounds
    resized = resizeImage(tocrack, (tocrack.width*6, tocrack.height*6))
    dilateImage(resized, 4)
    erodeImage(resized, 4)
    thresholdImage(resized, 200, cv.CV_THRESH_BINARY)

    if withContourImage: #If we want the image made only with contours
        contours = getContours(resized, 5)
        contourimage = cv.CreateImage(cv.GetSize(resized), 8, 3)
        cv.Zero(contourimage)
        cv.DrawContours(contourimage, contours, cv.Scalar(255), cv.Scalar(255), 2, cv.CV_FILLED)

        contourimage = resizeImage(contourimage, cv.GetSize(tocrack))
        resized = resizeImage(resized, cv.GetSize(tocrack))
        return resized, contourimage

    resized = resizeImage(resized, cv.GetSize(tocrack))
    return resized


def process_all(results):
    dir = "Ebay" #Consider that all images are stored in the dir 'Ebay'
    for file,r in zip(os.listdir(dir),results):
        im = cv.LoadImage(os.path.join(dir,file),cv.CV_LOAD_IMAGE_GRAYSCALE) #Load the image
        im = crack(im) #intent to crack it
        res = pytesser.iplimage_to_string(im,psm=pytesser.PSM_SINGLE_WORD) #Do characters recognition
        res = res[:-2] #Remove the two \n\n always put at the end of the result
        if res == r: #Compare the result of the value contained in our list
            print file+": "+res+" | "+r+ " OK"
        else:
            print file+": "+res+" | "+r+" NO"


if __name__=="__main__":

    #Execute the following once to setup the envirronement
    '''
    dir = "Ebay"
    url = "https://scgi.ebay.fr/ws/eBayISAPI.dll?FetchCaptchaToken&parentPage=RegisterEnterInfo&tokenString=5WWZNQcAAAA%3D&ej2child=true"
    pattern = "LoadBotImage"
    setup_Benchtest(dir, url, pattern)
    '''

    #The list contains to got results for my bench
    results=["139866","400961","387740","431418","750113","572574","155885","440543","826316","233388","840189","349093","751181","270699","356535","743987","467643","342527","992978","879970",""]
    process_all(results) #Process all the images

Results:

The table below show the results obtained with tests. On the left column the captcha image. The following column are:

  • Without processing: This is the result obtained doing applying the ocr engine on the original image. This gives 20% of accuracy.
  • After processing: Is the result obtained applying the ocr engine after the image processing. The result is 30% accurate
  • After processing with contours: This is the result applying the ocr engine on the contour version of the image. As we can see results are less accurate with 25%.

As we can see results are not satisfying at all, that’s why I have elaborated a probabilistic method presented in the next section.

Probabilistic cracking

I have elaborated this method with all the results obtained during my test. The fact is for a specific image I almost always find the right parameters to get a nice thresholded image that tesseract will recognise gently. The problem is this parameters change from an image to another. So the idea is to try multiples parameters for the same images, mix the results together to obtain the right recognition.

Note: The algorithm relies on fact that are highly dependent of the captcha engine. Here I based my alogrithm on the fact that characters are always numbers and they are always only 6 numbers.

The core loop which tries all the the different parameters is:

1
2
3
4
5
for dilate in [1,3,4,5]:
    for erode in [1,3,4,5]:
        for thresh in [125,150,175,200]:
            for size in [(int(w*0.5),int(h*0.5)),(w,h),(w*2,h*2),(w*3,h*3)]:
                val = self.crack(dilate, erode, thresh, size) #Call crack successively all parameters

So it will try 4 different numbers of dilate round, 4 different number of erode round, 4 different threshold values and 4 different size of images (which matter for detection). So 256 different images will b obtained and so 256 different values.

With this matters in mind the algorithm works as follow:

  • The image is processed with a set of values
  • The string retreived from tesserect is split. The first character is sent in a dictionnary that hold all the first characters, and the second is sent in a dictionnary that hold all second characters .. (If a number is not numerics it is not added to dictionnaries
  • At the end the class take the most occurring character of every dictionnaries and reconstitute a string of 6 numbers which constitute the final value.

For example at the end of the analysis of the first image “Ebay0.png” dictionnaries contains:

1
2
3
4
5
6
{'1': 19}
{'9': 11, '1': 6, '7': 1}
{'9': 4, '8': 9, '7': 5}
{'8': 7, '3': 1, '5': 8, '6': 5}
{'8': 6, '3': 1, '5': 12, '7': 1, '6': 1}
{'1': 1, '3': 2, '5': 5, '6': 3, '9': 1, '8': 2}

And the result is : 198555 (which is wrong). As we can see there the result for the first character is obvious, but for all the others there is a balance between multiples numbers.

The code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import pytesser
import os
import cv2.cv as cv
from generic_ocr_operations import *

class ProbabilisticCracker():

    def __init__(self, im):
        self.image = im
        self.values = [{},{},{},{},{},{}] #Will hold occurrence of characters
        self.ocrvalue = "" #Final value

    def getValue(self):
        return self.ocrvalue #Return the final value

    def crack(self,dilateiter=4, erodeiter=4, threshold=200, size=(155,55)): #Take all parameters
        resized = resizeImage(self.image, (self.image.width*6, self.image.height*6))

        dilateImage(resized, dilateiter)
        erodeImage(resized, erodeiter)
        thresholdImage(resized,threshold, cv.CV_THRESH_BINARY)

        resized = resizeImage(resized, size)

        #Call the tesseract engine
        ret = pytesser.iplimage_to_string(resized)
        ret = ret[:-2]
        return ret

    def run(self): #Main method
        w = self.image.width
        h = self.image.height

        for dilate in [1,3,4,5]:
            for erode in [1,3,4,5]:
                for thresh in [125,150,175,200]:
                    for size in [(int(w*0.5),int(h*0.5)),(w,h),(w*2,h*2),(w*3,h*3)]:
                        val = self.crack(dilate, erode, thresh, size) #Call crack successively all parameters
                        #print "Val:",val
                        self.accumulateChars(val) #Call accumulate
        self.postAnalysis()

    def accumulateChars(self,val):
        l = len(val)
        for i in range(6): #Only iterate the 6 first chars
            if i > l-1: #Break the length of the string lower
                break
            c = val[i]
            if c.isdigit(): #Put the char only if this is a digit
                if self.values[i].has_key(c):
                    self.values[i][c] += 1 #Add 1 if the entry character already exists
                else:
                    self.values[i][c] = 1

    def postAnalysis(self): #Analyse at the end
        for vals in self.values:#For every dictionnary
            c, v = self.max(vals) #Take the most occuring
            #print "Max:", c,v
            self.ocrvalue += c #Append it to the final string

    def max(self, d):
        m = 0
        elt = ''
        for k,v in d.items():
            if v > m:
                m = v
                elt = k
        return elt, m


def process_all(results):
    dir = "Ebay"
    for file,r in zip(os.listdir(dir),results): #For every file in the directory
        im = cv.LoadImage(os.path.join(dir,file),cv.CV_LOAD_IMAGE_GRAYSCALE) #Open the file
        cracker = ProbabilisticCracker(im) #Instantiate the ProbabilistricCracker
        cracker.run() #Run it
        res = cracker.getValue() #Take the final value

        if res == r: #Compare it with the right one
            print file+": "+res+" | "+r+ " OK"
        else:
            print file+": "+res+" | "+r+" NO"
        nb=6
        count = 0
        for c1,c2 in zip(res,r): #Make a char/char comparison to compute the accuracy
            if c1 == c2:
                count +=1
        print "Avg: ", (count*100)/nb, "%"


if __name__=="__main__":

    results=["139866","400961","387740","431418","750113","572574","155885","440543","826316","233388","840189","349093","751181","270699","356535","743987","467643","342527","992978","879970",""]
    process_all(results)

Results:

The table shown below shows the results of the probabilistic algorithm andthe accuracy. As we can see the algorithm detect more coherent suites of numbers but the overall average of good results is not really better.

Conclusion

When you attack a single captacha it always end up by working adjusting all parameters filters, round and so on, but when you try to elaborate a generic algorithm for a given captcha engine it is far more complex even for simple captcha as the Ebay captcha engine. But as we can see the probabilistic way give better results far more accurate but still not satisfiying due to the limitation of the OCR engine. As you can see the tesseract engine is really capricious (and does not seems to be the best).

With today’s captcha engines more than character it is more shape recoginition that should be done. So a good idea would be for a given captcha engine keep in a databse multiples instances of every characters. Then to decrypt a captcha every characters should be split and the shape compared with the registered elements in the database to find the right character. Obviously using this method makes OCR engines useless.

<<Motion detection | Home