Crypto-IT

Frequency Analysis

Frequency analysis is one of the known ciphertext attacks. It is based on the study of the frequency of letters or groups of letters in a ciphertext.

In all languages, different letters are used with different frequencies. For each language proportions of appearance of all characters are slightly different, so texts written in a given language have some certain common properties, which allow to distinguish them from texts written in other languages.

For example, in English there are often used vowels like e, o, a or a consonant t. On the other hand, there are some very rare letters, for example z or x. There are statements of frequencies of letters in different languages. The frequencies can be determined only approximately because in different kind of texts (scientific, historical, fiction) they are slightly different.

Each language has some typical and popular sequences of letters. In English, there are some common bigrams, like tr, er, on, an, ss, tt and ee. Based on that, one can distinguish an English text from texts written in other languages. It is possible to determine the correct order of letters from mixed words.

Frequency Analysis of Substitution Ciphers

Frequency analysis is used for breaking substitution ciphers. The general idea is to find the popular letters in the ciphertext and try to replace them by the common letters in the used language.

The attacker usually checks some possibilities and makes some substitutions of letters in ciphertext. He looks for possible appearing words and based on that makes more substitutions. Using computers, it is possible to try a lot of combinations in relative short time.

For example, if in the analyzed ciphertext the most popular letter is x, one may predict that x replaced e or o (one of the most popular letters in English) from the plaintext.

It is useful to look for popular pairs of letters or even try to predict some frequent longer sequences of letters or whole words. The intruder always tries to find sequences of letters which are often used in the selected language.

Date: 2020-03-09

a	8.17%
b	1.49%
c	2.78%
d	4.25%
e	12.70%
f	2.23%
g	2.02%
h	6.09%
i	6.97%
j	0.15%
k	0.77%
l	4.03%
m	2.41%
n	6.75%
o	7.51%
p	1.93%
q	0.10%
r	5.99%
s	6.33%
t	9.06%
u	2.76%
v	0.98%
w	2.36%
x	0.15%
y	1.97%
z	0.07%