19 November 2013

One of the projects that I've been working on is a set of tools to help with language-based puzzles (in the Mystery Hunt sense). Given how many different ideas you need to check over the course of solving a puzzle, the overhead of looking online for a tool that does a particular cipher, finds anagrams, or something else similar adds up to be significant.

Furthermore, these tools are often terrible at what they're meant to do. For example, say you look for a tool that will help you decode a message that has been Caesar shifted. You'll find that most of them ask you to input a shift value, despite the fact that there are only 26 possible options. It would be more useful if they displayed all the possible outputs with their respective shifts, but wouldn't it be nice if the tool could even recognize which shift was the decoded text?

So while generating all 26 shifts of a string is easy, how might you pick out the one that is actually English text? Some of you might immediately say, "Frequency analysis!" That's a reasonable idea, but it relies on the text being long enough that frequencies match up with English overall. Unfortunately, we often want to unshift very short strings. I'll use the puzzle Caesar's Palace to get examples. Consider the string Fego-wiex rek. The only real win that frequency analysis gives you is that 'e' should correspond to a common letter. This might get you somewhere, but I don't think it will be easy to reduce to just a single candidate without consulting a human.

So what is better than single letters? Bigrams! It's pretty clear that adjacent letters in English text are not independent, so by using bigrams you should be able to tell whether the letters are flowing into each other like they would in an English word. To gather data, I downlaoded the Open American National Corpus, split all the text into words, and then threw away all the non-alphabetic characters. To give you a sense of what the data looks like, the top few words are

820287 the
441259 of
411021 and
340520 to
308285 a
282433 in
198855 that
153145 i
146037 is
135339 for

From here, I could get counts for letters, bigrams, initial letters, and so on. Grading a string simply by bigram frequencies is probably going to be terrible, because bigrams aren't independent -- the second letter of one is necessarily the first letter of the next. So I had the idea to consider the conditional probability of one letter, given what the previous letter was.

Essentially, the idea is to make a Markov chain with a state for each letter, the start of a word, and the end of a word. Then the grade of a string can be based on the probability that the Markov chain would produce that string. Since the probability for a long string will obviously be ridiculously low, I actually used the log probability.

Simply choosing the Caesar shift with the highest score in this way seems to give the correct answer most of the time. For the example before, the shifts with their accompanying scores are

22  BACK-SEAT NAG   -35.553955
0   FEGO-WIEX REK   -40.14637
9   ONPX-FRNG ANT   -43.753338
4   JIKS-AMIB VIO   -45.285454
14  TSUC-KWSL FSY   -46.240368
16  VUWE-MYUN HUA   -53.537483
20  ZYAI-QCYR LYE   -54.406334
8   NMOW-EQMF ZMS   -54.7398
24  DCEM-UGCV PCI   -55.135036
10  POQY-GSOH BOU   -55.912315
7   MLNV-DPLE YLR   -58.000626
23  CBDL-TFBU OBH   -60.348557
6   LKMU-COKD XKQ   -60.728783
13  SRTB-JVRK ERX   -62.953842
21  AZBJ-RDZS MZF   -63.872337
15  UTVD-LXTM GTZ   -65.40635
3   IHJR-ZLHA UHN   -65.4341
11  QPRZ-HTPI CPV   -66.07852
1   GFHP-XJFY SFL   -70.07124
17  WVXF-NZVO IVB   -71.3998
25  EDFN-VHDW QDJ   -72.71949
18  XWYG-OAWP JWC   -73.333275
2   HGIQ-YKGZ TGM   -73.34305
12  RQSA-IUQJ DQW   -75.883766
5   KJLT-BNJC WJP   -76.96932
19  YXZH-PBXQ KXD   -81.58014

I used the natural logarithm, so the difference of about 4.6 between the top two results means that the string BACK-SEAT NAG is about 100 times more likely than the second, which was the input FEGO-WIEX REK.

Of course, there are some weaknesses with this method, and very short strings are still liable to give bad results. Some examples of this from the puzzle are BOX ixwn, where the correct shift SFO zone is in third behind IVE pedu and HUD odct. Another example is the ciphertext idxa vxuihu, where the almost-English toil giftsf barely beats out the correct faux surfer. However, for the most part it works very well, and of course it's more reliable for longer strings. Soon I'll be interested in seeing if this method can help for solving arbitrary cryptograms.