11 May 2016

Interviews in the software industry have produced some extraordinary negative reactions. There's one complaint in particular that comes up again and again: that algorithmic interview questions favor people who have the corresponding algorithm memorized.

I can't say for sure, but it sounds like the people who make this complaint imagine that the people who are succeeding at these interviews have some sort of mental algorithms checklist and go through that checklist when prompted with a problem. I know that this is an extremely poor description of my own problem solving methods, and I would expect that it's an extremely poor description for most successful problem solvers as well.

That said, there is certainly some amount of knowledge that I have memorized. That shouldn't be surprising at all. Do I have 2 times 3 memorized? Yes, of course. You probably do too. Do I have 13 times 23 memorized? No, not at all. Can I compute 13 times 23? Yes. If someone asks what 13 times 23 is, are they expecting the answerer to be recalling it from memory? Probably not.

Similarly, I expect that Google's interviewers are not expecting candidates to have answers to their questions memorized. If anything, it should be the opposite: if it is reasonable for someone to have seen the problem before and memorized the solution, the question loses some of its potential to provide signal to the interviewer.

Problem solving is built on top of a foundation, and typically that foundation gets memorized whether intentionally or not. So yes, I've memorized several special-purpose algorithms. But much more of what I have memorized are techniques that are applicable to many problems.

For example, there is a technique that I refer to as "bucketing" that is generally useful for speeding up range queries (asking about some computation on a range of input data rather than the entire input data). As one example, suppose you have an array of numbers and you want to support updating a single number or asking for the sum of a range of numbers.

Start: 4 1 3 8 5 2 3 4 7
Sum a+a+a+a = 1+3+8+5 = 17
Update a to 6
Sum a+a+a+a+a = 3+6+5+2+3 = 19
...

The bucketing technique relies on the fact that you don't need to sum up each number individually. If you've already summed an interval and that interval is entirely contained in the query, you can shortcut by using its precomputed sum. One common blocking method is that if you have $$N$$ items, split them into $$\sqrt{N}$$ blocks of $$\sqrt{N}$$ items.

4 1 3 | 8 5 2 | 3 4 7
8   |   15  |   14

With this example $$N = 9$$, so $$\sqrt{N} = 3$$. So now if we are queried about a+a+a+a+a, we already have the sum a+a+a precomputed. This $$\sqrt{N}$$ bucketing technique essentially decomposes the query interval into at most $$\sqrt{N}$$ subintervals, and at most $$2\sqrt{N}$$ lone elements on the ends. So it provides a general way of speeding up queries from $$O(N)$$ to $$O(\sqrt{N})$$, while retaining constant time updates.

Yes, I carry around this bucketing idea with me, as well as several other ideas that are related. For example, you can make a tree structure where each interval is split in half as you go down the tree. This tree structure is known in competitive programming as a "range tree", although it might have a different name in other literature. The range tree is a similarly general technique that speeds your queries up to $$O(\log N)$$, but your updates then also become $$O(\log N)$$.

Then yet another related idea is that of a binary index tree, which reduces the amount of storage required, but relies on (in this case) the fact that if $$A$$ is a subset of $$B$$, then you can compute the sum of $$B - A$$ from the sum of $$B$$ and the sum of $$A$$.

Certainly knowing this technique is valuable in an interview setting. But it's not because some interviewer is going to ask me "What's a range tree?" It's useful because it solves problems, and those problems aren't limited to interviews. At Dropbox I was given the task of speeding up a dashboard that computed certain statistics about availability data over a user-specified time range.

The availability data was stored on a minute-by-minute basis, and the existing dashboard would scan over the entire time range and essentially computed sums by individually adding up numbers for every minute. So if someone used the dashboard to compute availability over a month, it was scanning tens of thousands of database rows. So what I did was set up a bucketing scheme where the minutes were grouped on a day-by-day basis, and once the aggregate statistic was computed for the day, future queries containing that day could use the already computed value.

Alongside the complaints that interviews are about memorizing algorithms, I also see a lot of comments to the effect that algorithms are purely academic solutions that don't arise in practice. I believe that this is entirely false, and instead what is happening is that such scenarios arise frequently but the applicability of algorithm knowledge goes unnoticed. And, of course, when I say "algorithm knowledge," I don't mean memorizing individual algorithms.