Posted November 15, 2009 by Spyros in Cryptography

Create And Reverse Simple Substitution Ciphers For Fun (and Profit ?)


As kids most of us were fascinated by the constant war between the good and the evil when watching our favorite cartoons. Athough we are grown ups now (i suppose), there are still some battles that concern us. If you’re especially interested in cryptography, there is a very big chance that the cryptographers – cryptanalysts still fascinates you.

What’s even better is learning how to reverse certain cryptograms without the use of computer force. Cryptography, the art of making a message unreadable to people that were not intended to read it, has been around for many decades. Even the outcome of many wars was dependandle on how much “unbreakable” a certain cipher is.

The purpose of this post is not to describe the history of cryptography and  cryptanalysts, though i would like to write some things about it sometime. I thought that it would be a good idea to teach some of you how to create and reverse simple substitution ciphers. When i learned about this process, i began to like cryptography quite a lot. It’s a nice and funny process that will teach you some things and also let you have some fun (that is if you are fond of these things – else listening to some music could be funnier).

What is a Simple Substitution Cipher ?

Kids use them sometimes. What they do is substitute each letter of the alphabet with another letter. Therefore, if I is substituted with A and S is substituted with B, the word “IS” would be substituted with the word “AB”. When creating your own substitution alphabet, you may think that it is quite difficult to remember which letter substitutes each one. Well, that is true, but there is a technique that can solve this problem. What you do in order to remember an alphabet is have a certain keyword that only you and the people who you communicate with will know.

Let’s suppose that you and me decide to use the keyword “CRYPTOGRAPHY” and i want to send you a message saying “THERE IS NOTHING MORE IMPORTANT THAN KNOWLEDGE”. The first thing that i have to do to encrypt my message is create the alphabet. Using the keyword we both know, i do a very simple process:

1. Remove double letters from the keyword. For “CRYPTOGRAPHY”, when i remove the double letters, the word now becomes “CRYPTOGAH”.

2. Then, we form the alphabet using the non double letters keyword, as you can see below :


As you see what happens here is that i start using the letters of the keyword first. You may choose to put the keyword somewhere in the middle for some extra security, but i use the beginning here for the sake of this example. Then, after i use the letters of the keyword, i start using the remaining letters of the alphabet, starting from A and counting to Z. So, for the plaintext letter J i would use A. However, A was used before since it was a letter of the keyword. So, we move on to the next letter, B. B was not used in our keyword, so we can use it now. For the next plaintext letter, we cannot use C because it is one of the letters of our keyword that we used before. The procedure goes on till the end and the alphabet is formed. To encrypt our message now, we substitute each letter of our sentence using the alphabet above like :



As you see, the crypted text is not easy to be identified if you do not have the alphabet to do the reversed process. However, if somebody knows that this ciphered text is the product of a simple substitution cipher, he can easily reverse it and get the original message.

How to Cryptanalyze a Simple Substitution Cipher

In order to reverse a simple substitution cipher, we have to think smart. A quite important idea is that there are some letters having a bigger frequency when compared to others. You can see the frequency of english letters here. The simple word to remember is “etao”. The most used letter is e, then comes t and then a and o. Therefore, if we count the letters of the ciphertext, we can extract some information :

Ciphertext :     QATMT    HN    IJQAHIG         FJMT      HFKJMQCIQ      QACI         DIJVETPGT

Letter Frequency For The Ciphertext

T = 5, Q = 5, J = 4, A,H = 3

So, what do we see here ? T, Q and J are the most frequently used letters in the ciphertext. This is how i would go about deciphering this message :

(supposing that i don’t know anything about the simple substitution alphabet and keyword, i only have the ciphertext and i know it is a simple substitution cipher)

1. The Frequency of letter T is quite high, which means that is is most probable to be some of the letters e,t,a or o with e and t having the biggest probability.

Let’s suppose that the ciphertext letter T corresponds to the cleartext letter E. If we substitute it to our cipher, we get :

(the substituted letters are in lower case)

QAeMe    HN    IJQAHIG         FJMe      HFKJMQCIQ      QACI         DIJVEePGe

This seems that it could be right, but we cannot be sure yet. We have to keep on trying with other letters as well. Now, let’s examine the ciphertext letter Q. Since we now used E already, it would be a good idea to choose between the cleartext letters t,a,o. Maybe we will have to try them all. As you see it is a trial and error procedure. Suppose that Q corresponds to the cleartext letter T. The ciphertext now becomes :

tAeMe    HN    IJtAHIG         FJMe      HFKJMtCIt      tACI         DIJVEePGe

or if you remove the unknown letters :

t_e_e      _ _      _ _t_ _ _ _     _ _ _e     _ _ _ _ _t_ _t     t_ _ _     _ _ _ _ _e_ _e

Still possible. Now, for the next letter in the frequency chain. That is J, having the frequency of 4. As you can understand, it is quite possible that this letter is either a or o. Let’s try a:

tAeMe    HN    IatAHIG         FaMe      HFKaMtCIt      tACI         DIaVEePGe

or without the unknown letters :

t_e_e      _ _      _ at_ _ _ _     _ a _e     _ _ _ a _t_ _t     t_ _ _     _ _ a _ _e_ _e

Ok, as you see at this time, we are not really at the right track. Some things are ok, some things are not and we really can’t be sure whether we have done things correctly. At that point, we can follow different paths :

1. Keep going with the letter frequency and when we reach a letter that does not budge, go one or more steps back.

2. Be more versatile and take a wild guess.

Let’s try the second option this time. Do you notice the second word ? It consists of two letters. As you know, there are not so many two letter combinations that form a word in the english language. The most used ones are “to”, “or”, “at”, “be”, “of”,”is”. In this situation we have already used the letter t, so some of the possibilities that we would want to try are “or”, “be”,”of” or “is”. Now, let’s think a bit more. What are the chances that the second word is “or” ? This would make sense, it would fit. What about “be” ? Nah, not really. This wouldn’t fit as a second word. “of” or “is” could be valid choices as well. What would you try first ? I’m quite sure that you would try “is”. It would really fit here since it is a verb and verbs are widely used as a second word in a phrase. Let’s substitute the ciphertext with the idea that the ciphertext “HN” corresponds to “IS”:

t_e_e      is      _ at _i_ _     _ a _e     i _ _ a _t_ _t     t_ _ _     _ _ a _ _e_ _e

What about the first word ? What could it possibly be ?

t_e_e is ……

Could it be “theme” ? Nah. Maybe “there” ? Yes, could be the word there indeed. “There is” sounds like a nice possibility. Let’s substitute :

there      is      _ athi_ _     _ are     i _ _ art_ _t     th _ _     _ _ a _ _e_ _e

Do you remember that we select the cleartext letter a for the ciphertext letter J right ? As you see in the frequencies of the letters, o is another possibility. How does the ciphertext become if we consider J to be o instead of a ?

there      is      _ othi_ _     _ ore     i _ _ ort_ _t     th _ _     _ _ o _ _e_ _e

Both version seem plausible, but the first one seems to be having a problem. What about the third word ? It does not really make sense. What could it be ? “pathing” ? “bathing”? Not really likely. Could be correct, but signs show that it most probably isn’t. On the other hand, the second version seems a little bit better don’t you think ? Ok, let’s go on examining the 6th word this time. It is “th_ _”. Maybe “then” ? Or “this” ? “than” ? Well, we have used the letters s, i and e already, so than seems to be a pretty good choice. Let’s substitute with the J = O version:

there      is      nathin_     _ are     i _ _ artant     than     _ na _ _e_ _e

Nathin_ ? No chance, this is no good, there is something wrong here. What about the J = A version ?

there      is      nothin_     _ ore     i _ _ ortant     than     _ no _ _e_ _e

Ohh, look at that ! The whole phrase is almost revealed now. The third word is apparently “nothing”. “There is nothing …”. Nothing what ? Nothing more of course ! After the final substitutions it is very easy to see that the phrase is “THERE IS NOTHING MORE IMPORTANT THAN KNOWLEDGE”.

So, now you saw how the whole process unravels. You try a possibility and see what happens. Go back and correct or accept the result. In the end you will get the cleartext for sure. After that, you can also reform the alphabet and extract the keyword so that the next time you would want to decrypt a message, you would easily create the alphabet and do the reverse process. Hope i was clear enough and helped you understand how this works. Hopefully, in the next cryptography related post, i will be explaining how to reverse multi alphabetic ciphers like Vigenere.