Kasiski-Babbage Cryptanalysis in Python

This blog post shows a basic python implementation of the Kasiski-Babbage Cryptanalysis algorithm for solving a basic Vigenere crypto challenge.

 Principle

The Kasiski-Babbagge method is based on the fact that once we have found the key length found the clear text is reduced to find the clear text of a ceaser ciphered text (which is really easy).

Another principle of the method is that 3 or more characters at two different position in the ciphered text are most likely to have been ciphered with the same 3 or more characters of the key. Based on this fact the distance between the two suite of characters is very likely to be a multiple of the key length. Example:

As we can see on the picture above there is 3 patterns with a respective distance of 70, 40 and 10. So we can easily state that the key length is a multiple of the three distances which is 10. So, we can suppose that key is 10 characters long.

The key idea is then to split the text in 10 sub-texts with every ten characters in order to create 10 samples where the frequency of chars can be analyzed with a classical ceaser scheme.

Code !

The input file used for this example can be downloaded here

Now the key part of the cryptanalysis is to develop the algorithm to find all the tuples throughout the text calculate theirs distance and the divisors of it. So the first function to write is a function that calculate all the divisors of a value. The function below do the job and for the given integer return a list of all the divisors.

Get all the divisors of a given number
1
2
3
4
5
6
def getDivisors(n):
    l = []
    for i in range(2,n):
        if n % i == 0:
            l.append(i)
    return l

Now let’s move to the core function that find all the tuples. This function return two values that are the number of tuples found and a dirty list of the divisors of all tuples. (You will see later how it will be used).

Find identical patterns in list (size > 3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#l argument should be a list containing all bytes of the file (read with toList)
def getTuples(l):
    res = {}
    freq =[]
    count = 0
    i = 0
    while i < len(l): # Loop through all the list
        elt= l[i:i+3] # Take at least 3-character length for tuples
        long = len(elt)
        if long == 3: #should be 3 if not means we are at the end of the list
            for j in range(i+1,len(l)): #Find further in the list for the same pattern
                if l[i:i+long] == l[j:j+long]: #If match the 3-char check for more
                    while l[i:i+long] == l[j:j+long]:
                        long = long + 1
                    long = long -1
                    elt = l[i:i+long] # Now we have a tuple 
                    diff = j - i # Compute the distance
                    freq.extend(getDivisors(diff)) #Add the divisors to the list 
                    print ("%s\ti:%s\tj:%s\tdiff:%s\t\tDivisors:%s" % (elt,i,j, diff,getDivisors(diff))) #Print information about the tuple (can be deleted)
                    count = count +1
                    j = j + long + 1
            i = i + long -3 +1
        else:
            i = i + 1
    return count, freq

The next step is to make the count each divisors and sort them in descending order. The following function return a list sorted containing a tupe of the divisor number and the number of hits.

1
2
3
4
5
6
7
8
def countOcc(l): # return list with (decimal_char, occ) 
    d={}
    for elt in l:
        if d.has_key(elt):
            d[elt] += 1
        else:
            d[elt] = 1
    return sorted(d.items(),key=lambda x: x[1], reverse=True)

The list returned by this function is: [(2, 143), (5, 142), (10, 138), (4, 72), (20, 69), (6, 42), (3, 42), (15, 41), (8, 39), (30, 39), (40, 36), (7, 29), (14, 29), (35, 29), (25, 28), (50, 24), (70, 24), (12, 20), (60, 20)]

In this example we can see that 2 is a divisor for 138 tuple etc. But there is also 5 and 10 that are close and 4 is far behind. Moreover 2 and 5 are also divisors of 10 so this is obvious that the key is 10 characters long.

Now that we “know” the key length we just have to split the ciphered list into 10 sub-lists modulo 10. The function below return take in argument the key length and the original list and return a dictionnary that use the position as key (ex: 1 to 10) and all the sub-lists as items.

Split file in sublist of 10 chars
1
2
3
4
5
6
7
8
9
10
11
def explode(key,li):
    dic = {}
    for e in range(1,key+1):
        dic[e] = []
    i = 0
    for index in range(len(li)):
        if i == key:
            i = 0
        dic[i+1].append(li[index])
        i = i + 1
    return dic

The next step is the second delicate step. Indeed we need to find the shift for each sub-list. To find it we can do a frequency analysis considering the space as the most frequent character like it used to be in english. So what we will do here is do a frequency analysis for each sub-list. For each sub-list we consider the most frequent character as a space and then apply the associated shift to all character. And then we recreate the original list with all decipher sub-list and then recreate the text. Once it’s done we can easily identify if for one of the sub-list the space was not the most frequent and then we can try with the second most frequent and so on. The first function we need to acomplish this task is a function that decipher a list with a given shift:

1
2
3
4
5
6
7
8
9
def decipher(l,diff):
    newl = list()
    for e in l:
        val = e - diff
        if val < 0:
            newl.append(256 + (val % -256))
        else:
            newl.append(val)
    return newl

Now we need a function that do the reverse of the explode function. So this function takes back the dictionnary returned by explode and recreate the list.

1
2
3
4
5
6
7
8
9
10
11
def recreate(dic):
    i = 0
    output = []
    try:
        while 1:
            for l in dic.values():
                output.append(l[i])
            i = i + 1
    except:
        pass
    return output

To finish the following piece of code explode the original list, makes a frequency analysis decipher all the sub-list, then recreate the list and end up by reconverting every decimal character into character in order to print the result.

final code to trigger the all
1
2
3
4
5
6
7
8
9
res = explode(10, l) #We consider in this exemple a key length of 10 and l the original list
for i in range(1,10+1): #For each sub-list
    occ = countOcc(res[i]) #Make a frequency analysis
    shift = (occ[0][0] - 32) % 256 # Consider the most frequent element of being a space(32 in decimal)
    print ("Frequency analysis for the index: %s\tshift:%s\n%s\n" % (i,shift,occ)) #Print informations (can be deleted)
    res[i] = decipher(res[i],shift) #Try do decipher using a classical ceaser function and the determined shift

final = recreate(res) #Once we have processed all sub-list recreate a list with all the sub-lists.
print ''.join([chr(x) for x in final]) #Print the result