One of the projects that I've been working on is a set of tools to help with language-based puzzles (in the Mystery Hunt sense). Given how many different ideas you need to check over the course of solving a puzzle, the overhead of looking online for a tool that does a particular cipher, finds anagrams, or something else similar adds up to be significant.
Furthermore, these tools are often terrible at what they're meant to do. For example, say you look for a tool that will help you decode a message that has been Caesar shifted. You'll find that most of them ask you to input a shift value, despite the fact that there are only 26 possible options. It would be more useful if they displayed all the possible outputs with their respective shifts, but wouldn't it be nice if the tool could even recognize which shift was the decoded text?
So while generating all 26 shifts of a string is easy, how might you pick out
the one that is actually English text? Some of you might immediately say,
"Frequency analysis!" That's a reasonable idea, but it relies on the text
being long enough that frequencies match up with English overall.
Unfortunately, we often want to unshift very short strings. I'll use the puzzle
to get examples. Consider the string
Fego-wiex rek. The only real win that
frequency analysis gives you is that 'e' should correspond to a common letter.
This might get you somewhere, but I don't think it will be easy to reduce to
just a single candidate without consulting a human.
So what is better than single letters? Bigrams! It's pretty clear that adjacent letters in English text are not independent, so by using bigrams you should be able to tell whether the letters are flowing into each other like they would in an English word. To gather data, I downlaoded the Open American National Corpus, split all the text into words, and then threw away all the non-alphabetic characters. To give you a sense of what the data looks like, the top few words are
820287 the 441259 of 411021 and 340520 to 308285 a 282433 in 198855 that 153145 i 146037 is 135339 for
From here, I could get counts for letters, bigrams, initial letters, and so on. Grading a string simply by bigram frequencies is probably going to be terrible, because bigrams aren't independent -- the second letter of one is necessarily the first letter of the next. So I had the idea to consider the conditional probability of one letter, given what the previous letter was.
Essentially, the idea is to make a Markov chain with a state for each letter, the start of a word, and the end of a word. Then the grade of a string can be based on the probability that the Markov chain would produce that string. Since the probability for a long string will obviously be ridiculously low, I actually used the log probability.
Simply choosing the Caesar shift with the highest score in this way seems to give the correct answer most of the time. For the example before, the shifts with their accompanying scores are
22 BACK-SEAT NAG -35.553955 0 FEGO-WIEX REK -40.14637 9 ONPX-FRNG ANT -43.753338 4 JIKS-AMIB VIO -45.285454 14 TSUC-KWSL FSY -46.240368 16 VUWE-MYUN HUA -53.537483 20 ZYAI-QCYR LYE -54.406334 8 NMOW-EQMF ZMS -54.7398 24 DCEM-UGCV PCI -55.135036 10 POQY-GSOH BOU -55.912315 7 MLNV-DPLE YLR -58.000626 23 CBDL-TFBU OBH -60.348557 6 LKMU-COKD XKQ -60.728783 13 SRTB-JVRK ERX -62.953842 21 AZBJ-RDZS MZF -63.872337 15 UTVD-LXTM GTZ -65.40635 3 IHJR-ZLHA UHN -65.4341 11 QPRZ-HTPI CPV -66.07852 1 GFHP-XJFY SFL -70.07124 17 WVXF-NZVO IVB -71.3998 25 EDFN-VHDW QDJ -72.71949 18 XWYG-OAWP JWC -73.333275 2 HGIQ-YKGZ TGM -73.34305 12 RQSA-IUQJ DQW -75.883766 5 KJLT-BNJC WJP -76.96932 19 YXZH-PBXQ KXD -81.58014
I used the natural logarithm, so the difference of about 4.6 between the top
two results means that the string
BACK-SEAT NAG is about 100 times more
likely than the second, which was the input
Of course, there are some weaknesses with this method, and very short strings
are still liable to give bad results. Some examples of this from the puzzle
BOX ixwn, where the correct shift
SFO zone is in third behind
IVE pedu and
HUD odct. Another example is the ciphertext
where the almost-English
toil giftsf barely beats out the correct
faux surfer. However, for the most part it works very well, and of course
it's more reliable for longer strings. Soon I'll be interested in seeing if
this method can help for solving arbitrary cryptograms.