One of the projects that I've been working on is a set of tools to help with language-based puzzles (in the Mystery Hunt sense). Given how many different ideas you need to check over the course of solving a puzzle, the overhead of looking online for a tool that does a particular cipher, finds anagrams, or something else similar adds up to be significant.
Furthermore, these tools are often terrible at what they're meant to do. For example, say you look for a tool that will help you decode a message that has been Caesar shifted. You'll find that most of them ask you to input a shift value, despite the fact that there are only 26 possible options. It would be more useful if they displayed all the possible outputs with their respective shifts, but wouldn't it be nice if the tool could even recognize which shift was the decoded text?
So while generating all 26 shifts of a string is easy, how might you pick out
the one that is actually English text? Some of you might immediately say,
"Frequency analysis!" That's a reasonable idea, but it relies on the text
being long enough that frequencies match up with English overall.
Unfortunately, we often want to unshift very short strings. I'll use the puzzle
to get examples. Consider the string
Fego-wiex rek. The only real win that
frequency analysis gives you is that 'e' should correspond to a common letter.
This might get you somewhere, but I don't think it will be easy to reduce to
just a single candidate without consulting a human.
So what is better than single letters? Bigrams! It's pretty clear that adjacent letters in English text are not independent, so by using bigrams you should be able to tell whether the letters are flowing into each other like they would in an English word. To gather data, I downlaoded the Open American National Corpus, split all the text into words, and then threw away all the non-alphabetic characters. To give you a sense of what the data looks like, the top few words are
From here, I could get counts for letters, bigrams, initial letters, and so on. Grading a string simply by bigram frequencies is probably going to be terrible, because bigrams aren't independent -- the second letter of one is necessarily the first letter of the next. So I had the idea to consider the conditional probability of one letter, given what the previous letter was.
Essentially, the idea is to make a Markov chain with a state for each letter, the start of a word, and the end of a word. Then the grade of a string can be based on the probability that the Markov chain would produce that string. Since the probability for a long string will obviously be ridiculously low, I actually used the log probability.
Simply choosing the Caesar shift with the highest score in this way seems to give the correct answer most of the time. For the example before, the shifts with their accompanying scores are
22 BACK-SEAT NAG -35.553955
0 FEGO-WIEX REK -40.14637
9 ONPX-FRNG ANT -43.753338
4 JIKS-AMIB VIO -45.285454
14 TSUC-KWSL FSY -46.240368
16 VUWE-MYUN HUA -53.537483
20 ZYAI-QCYR LYE -54.406334
8 NMOW-EQMF ZMS -54.7398
24 DCEM-UGCV PCI -55.135036
10 POQY-GSOH BOU -55.912315
7 MLNV-DPLE YLR -58.000626
23 CBDL-TFBU OBH -60.348557
6 LKMU-COKD XKQ -60.728783
13 SRTB-JVRK ERX -62.953842
21 AZBJ-RDZS MZF -63.872337
15 UTVD-LXTM GTZ -65.40635
3 IHJR-ZLHA UHN -65.4341
11 QPRZ-HTPI CPV -66.07852
1 GFHP-XJFY SFL -70.07124
17 WVXF-NZVO IVB -71.3998
25 EDFN-VHDW QDJ -72.71949
18 XWYG-OAWP JWC -73.333275
2 HGIQ-YKGZ TGM -73.34305
12 RQSA-IUQJ DQW -75.883766
5 KJLT-BNJC WJP -76.96932
19 YXZH-PBXQ KXD -81.58014
I used the natural logarithm, so the difference of about 4.6 between the top
two results means that the string
BACK-SEAT NAG is about 100 times more
likely than the second, which was the input
Of course, there are some weaknesses with this method, and very short strings
are still liable to give bad results. Some examples of this from the puzzle
BOX ixwn, where the correct shift
SFO zone is in third behind
IVE pedu and
HUD odct. Another example is the ciphertext
where the almost-English
toil giftsf barely beats out the correct
faux surfer. However, for the most part it works very well, and of course
it's more reliable for longer strings. Soon I'll be interested in seeing if
this method can help for solving arbitrary cryptograms.