15 February 2017

It's been almost a year since AlphaGo vs Lee Sedol and since then there's been a lot of discussion about the games from that series, as well as a series of 50 games that it played online under the name Master. A lot of the discussion has actually been somewhat frustrating to read, so I'm going to vent for a bit on how I believe AlphaGo should be currently viewed.

In order to properly discuss my main rant, I'm going to first give a brief introduction to endgame theory.

The most important idea to be able to analyze endgames is the idea of adding two games together. The sum of two games is another game where on your turn you pick one of the two games to play in. So you could imagine a game of "chess plus checkers" where each turn is either a turn on the chess board or a turn on the checkers board. Say your opponent makes a move on the chess board. Now you have a choice: do you want to respond to that move also on the chess board, or is it better to take a turn on the checkers board and accept the potential loss of allowing two consecutive chess moves?

If you were to actually add a game of chess and a game of checkers, you'd have to also determine a way to say who wins. I'm going to conveniently avoid talking about that for general games, because for Go positions the answer is simple: add up the points from each game. So you could imagine a game of "Go plus Go" where you're playing simultaneously on two boards, and on your turn you pick one of the boards to play on. At the end of the game, instead of counting territory from just one board, you count it from both.

As it turns out, when a Go game reaches the final stages, the board is typically partitioned into small areas that don't interact with each other. In these cases, even though these sections exist on the same board, you can think of them being entirely separate games being added together. Once we have that, there's still the question: how do you determine which section to play in?

## Temperature

Here is one of the simplest positions that is still unsettled. If black plays first, then he can make a point (marked with a triangle in the image below). On the other hand, if white plays first, then she can prevent black from making that point.

You're probably tempted to describe this type of move as a "one-point" move, because the difference between moving first or moving second is one point. However, a better way to think about it is that the unsettled position has a value of 0.5 black points, and either player may play there to gain 0.5 points.

This concept generalizes to the idea of the temperature of a game. We define an "environmental" version of a position by making passing worth some number of points. For the position above, imagine there was a coupon worth one point that you could take instead of playing, but only one such coupon. How would the game go?

If black plays first, say they play on the Go board. Then white takes the coupon and both players end up with one point. On the other hand, if black takes the coupon, then after white plays on the Go board, black has a point from the coupon while white has nothing. So black will prefer to take the coupon.

If white plays first and plays on the Go board, then black will get the coupon and again black will end up with one point while white ends up with nothing. On the other hand, if white takes the coupon, then after black plays on the Go board, both players end up with one point. So white also prefers to take the coupon.

The fact that both players would rather take a one point coupon than play in this position indicates that the coupon is worth more. What if the coupon were only worth, say, 0.2 points? In this case, both players will prefer to play on the Go board. Let's see why:

If black goes first and takes the coupon, then they'll get 0.2 points while white has nothing, a win by 0.2. On the other hand, if black plays on the Go board, then they get a full point, while white only gets 0.2 points from the coupon, which is a win by 0.8 points. So black prefers to play on the Go board. Similarly with white going first, if white takes the coupon black will win by 0.8, but if white plays on the board then black will only win by 0.2. So white also prefers to play on the Go board.

Now what if the coupon is worth exactly 0.5? In this case, the players will both be indifferent between taking the coupon and playing on the board. If black plays first and takes the coupon, black will have 0.5 points while white has 0, a win by 0.5. If black plays first on the board, then black will have 1 point while white has 0.5, again a win by 0.5. Similarly, regardless of white's decision, black will win by 0.5.

The size of the coupon where players are indifferent between playing and passing is called the temperature of the game. In this case, the temperature is 0.5. Usually, when presented with the sum of two games, you'll want to play in the one with the higher temperature. In other words, a lot of the calculation that goes into finding the proper endgame play is determining the temperatures of the various positions that are on the board.

Be careful, though! Finding temperature is more than just looking at two results and splitting the difference.

Here's an endgame position and the results if black moves first and if white moves first. Comparing the two, we can see that if black moves first, he ends up with one more point and white having two fewer points than if white moves first, a total difference of three points. However, the position does not have a temperature of 1.5! Let's imagine a game with a coupon of size 1.5 and see what happens. In order to conveniently talk about values, let's consider the position where white moves first a value of 0, so black moving first gains 3 points.

If black moves first and takes the coupon, then he'll end up with 1.5 points from the coupon, but 0 points from the board, a total gain of 1.5. On the other hand, if black moves first on the board, then after white responds with moves 2 and 4, black then gets to take the coupon anyway! So black gains 1.5 points from the coupon along with the 3 points from the board, a total steal. If white tried to take the coupon instead of playing 4, the result would be even worse:

If white moves first, they'll also want to play on the board, because while black will get the 1.5 point coupon, it prevents the larger 3 point gain on the board. So the temperature of this position is certainly larger than 1.5. In fact, from white's perspective, the temperature of this position is exactly 3. In the presence of a 3 point coupon, white knows that black will get one of the two 3 point gains, and it doesn't matter which one.

However, from black's perspective, things are a more complicated. In the presence of a 3 point coupon, black should play on the board because he'll get both the 3 point gain from the board as well as the 3 point gain from the coupon! This is what Go players are referring to when they say that a play is "sente": the followup is so big that the other player is essentially forced to respond. So black won't actually take a coupon unless it's so much bigger that white is still happy to take the coupon instead of responding to black's move. I'm not sure exactly what value this would be, but for example, 20 points would certainly suffice.

But that's still not the full story. When considering the temperature of a game, it's best not to think of it in the presence of a single coupon, but with a big stack of coupons of diminishing value. So imagine that there's a coupon worth 10 points, then a coupon worth 9.5, then one worth 9, and so on down to 0.5. In this case, white still won't play on the board until the size of the coupon reaches 3 points. Let's see why.

If the largest coupon available is 3 points, then say white plays on the board. After the sequence shown above, the players will both just take coupons so black gets the 3 point coupon, the 2 point coupon, and the 1 point coupon, while white gets the 2.5 point coupon, the 1.5 point coupon, and the 0.5 point coupon. In this case, black gets a total of 6 points, while white gets a total of 4.5 points: a total 1.5 point advantage for black. On the other hand, if white takes the coupon, black will play out the board sequence, and then the players will exchange coupons. The result is that white had the 3, 2, and 1 point coupons, while black gets the 2.5, 1.5, and 0.5 point coupons, but black also gets the 3 points on the board, again a net of 1.5 points for black.

So white will always pick to take coupons larger than 3 points, but will respond to black playing on the board as long as the coupon isn't enormous (larger than black's followup). So black has some flexibility in when to play this move, but he definitely wants to play it before white will be incentivized to take it away. Because of this flexibility, in an actual game, we'd expect black to always have some opportunity to play it at a time where white wants to respond. In this case, we can loosely think of the game as having a range of temperatures, and black should initiate play in it somewhere in that range.

## The Secret

Most people look at this endgame theory and say okay that's nice, but most games are decided in the middlegame where there's a lot of interactions between sections of the board. But the truth is that the endgame principle of playing in the hottest area carries over very well into the opening and middlegame. While there are some considerations of how the different parts of the board interact, if you just try to determine whether a local result is good and strive for good local results across the board, you'll tend to end up in a good position for the middlegame.

The problem, of course, is how to determine what the hottest area of the board is. When we're looking at nicely isolated endgame positions like before, we can compute the temperature from the game tree. In theory, you could do that for any position, except that the game tree is way too large. So people have learned how to estimate the temperature, or urgency, of various local positions. Currently pros estimate that the temperature of a blank board is somewhere between 13 and 15 points.

## Humans really do understand Go

So that brings me to the thing that irks me most. Humans have a really good idea of how to play Go. I don't mean this in the sense that top pros should be unbeatable. AlphaGo proved that clearly false. However, the qualitative aspects of the flow of the game are generally correct. What is likely not correct is the quantitative aspect. In other words, while we can say things like "urgent is more important than big", when it comes to looking at two parts of the board that would only be described as "big", all humans, including top professionals, will sometimes be wrong about which one is bigger.

Here's an example from chess. A doubled pawn is a set of two same-colored pawns on the same file, so that one is blocking the other. It's generally known even to weak amateur players that two pawns on the same file is worse than two pawns on different files. But exactly how much worse? Well, you'd probably be happy to be up a pawn even if it meant that one of your pawns was doubled (5 pawns on 4 files vs 4 pawns on 4 files). So the downsides of having a doubled pawn is certainly smaller than the upside of having a pawn.

But now what if you had two sets of doubled pawns? Would you rather have 5 pawns on 3 files or 4 pawns on 4 files? As someone who has never seriously studied chess, I have no idea. I'd guess it usually depends on the tactical positioning of said pawns. But maybe it makes sense to say that a doubled pawn is worth half of an undoubled pawn, or something like that. That's the sort of quantitative question that's very difficult for humans to answer.

In Go, there are lots of similarly difficult quantitative questions right from the start of the game. It's well-known that it is generally bigger to approach a 3-4 stone than a 4-4 stone. But exactly how much bigger? And how big are each of the followups?

While humans are bad at assigning precise values to these kinds of things, computers are very good at it. When AlphaGo looks at a position it might try move A and move B and find that move A wins in 51% of its simulations, while move B wins in 50.5%. It's very easy for AlphaGo to look at those numbers and say "Ah, move A is better." When a human looks at a position like that, they won't be able to create such precise numbers. Most likely, they'll come up with an answer like "Both moves are OK."

Here's the catch: there's no guarantee that AlphaGo's numbers actually reflect which move is better. However, based on the fact that AlphaGo has consistently won against top professionals, it appears that AlphaGo's value judgments are better than current human value judgments. But it pays to have some perspective here. In top professional games, a lead of more than 2.5 points is considered overwhelming, and so even seemingly minor improvements to early game judgments could lead to these sorts of results.

This is also a perspective that seems to be lost when people claim that AlphaGo could probably give top professionals a 2 stone handicap. If you look at an amateur rating system, such as on KGS, you'll see that a two rank difference (which is where you'd typically give a two stone handicap) corresponds to about 85% of games going to the stronger player. AlphaGo has shown that it can achieve these kinds of winrates, but two stones at the professional level mean significantly more than two stones at the amateur level. Fighting against a two stone handicap is similar to fighting against an opponent who gets 20 points more than you, and at the professional level even a quarter of that is huge.

I also feel that a lot of the comparison between machine learning and human learning is misrepresented. Once we had strong chess engines, humans became significantly better at the game, in part because they were able to see answers to those hard quantitative questions and learn to mimic them. Human players today are far ahead of chess engines from the 1990s and early 2000s. In other words, computer chess engines were an instrumental part of how humans learned chess.

We should expect Go to be the same way. Two of the most striking differences between AlphaGo's play as Master and standard top professional play are AlphaGo's tendency to play the large knight enclosure off the 3-4 point and its tendency to ignore moves in order to approach a 3-4 stone. To my amateur eye, these are indications of places where humans have consistently misevaluated a position. The large knight enclosure is probably better than it gets credit for, and approaching a 3-4 stone is probably even bigger than we already thought it was.

It will take plenty of time for humans to really incorporate these improvements into their play, and it's very difficult to do with only game records rather than interactive access to the engine. This is one of the reasons why I am much more interested in the progress of Zen than AlphaGo, even though AlphaGo is clearly stronger. Zen is being produced for commercial sale, while AlphaGo is only run through DeepMind. When a publicly available program surpasses top professionals, that is when we'll begin to truly see the game change.