This blog post shows a basic python implementation of the Kasiski-Babbage Cryptanalysis algorithm for solving a basic Vigenere crypto challenge.
Principle
The Kasiski-Babbagge method is based on the fact that once we have found the key length found the clear text is reduced to find the clear text of a ceaser ciphered text (which is really easy).
Another principle of the method is that 3 or more characters at two different position in the ciphered text are most likely to have been ciphered with the same 3 or more characters of the key. Based on this fact the distance between the two suite of characters is very likely to be a multiple of the key length. Example:
As we can see on the picture above there is 3 patterns with a respective distance of 70, 40 and 10. So we can easily state that the key length is a multiple of the three distances which is 10. So, we can suppose that key is 10 characters long.
The key idea is then to split the text in 10 sub-texts with every ten characters in order to create 10 samples where the frequency of chars can be analyzed with a classical ceaser scheme.
Code !
The input file used for this example can be downloaded here
Now the key part of the cryptanalysis is to develop the algorithm to find all the tuples throughout the text calculate theirs distance and the divisors of it. So the first function to write is a function that calculate all the divisors of a value. The function below do the job and for the given integer return a list of all the divisors.
1 2 3 4 5 6 |
|
Now let’s move to the core function that find all the tuples. This function return two values that are the number of tuples found and a dirty list of the divisors of all tuples. (You will see later how it will be used).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
The next step is to make the count each divisors and sort them in descending order. The following function return a list sorted containing a tupe of the divisor number and the number of hits.
1 2 3 4 5 6 7 8 |
|
The list returned by this function is: [(2, 143), (5, 142), (10, 138), (4, 72), (20, 69), (6, 42), (3, 42), (15, 41), (8, 39), (30, 39), (40, 36), (7, 29), (14, 29), (35, 29), (25, 28), (50, 24), (70, 24), (12, 20), (60, 20)]
In this example we can see that 2 is a divisor for 138 tuple etc. But there is also 5 and 10 that are close and 4 is far behind. Moreover 2 and 5 are also divisors of 10 so this is obvious that the key is 10 characters long.
Now that we “know” the key length we just have to split the ciphered list into 10 sub-lists modulo 10. The function below return take in argument the key length and the original list and return a dictionnary that use the position as key (ex: 1 to 10) and all the sub-lists as items.
1 2 3 4 5 6 7 8 9 10 11 |
|
The next step is the second delicate step. Indeed we need to find the shift for each sub-list. To find it we can do a frequency analysis considering the space as the most frequent character like it used to be in english. So what we will do here is do a frequency analysis for each sub-list. For each sub-list we consider the most frequent character as a space and then apply the associated shift to all character. And then we recreate the original list with all decipher sub-list and then recreate the text. Once it’s done we can easily identify if for one of the sub-list the space was not the most frequent and then we can try with the second most frequent and so on. The first function we need to acomplish this task is a function that decipher a list with a given shift:
1 2 3 4 5 6 7 8 9 |
|
Now we need a function that do the reverse of the explode function. So this function takes back the dictionnary returned by explode and recreate the list.
1 2 3 4 5 6 7 8 9 10 11 |
|
To finish the following piece of code explode the original list, makes a frequency analysis decipher all the sub-list, then recreate the list and end up by reconverting every decimal character into character in order to print the result.
1 2 3 4 5 6 7 8 9 |
|