Evaluation options - Overview of methods
Author: Jeff Rollason

Copyright Notice: This article is Copyright AI Factory Ltd. Ideas and code belonging to AI Factory may only be used with the direct written permission of AI Factory Ltd.

This looks at the whole idea of creating artificial players to play human parlour games. The article is a broad sweep but I will be narrowing slightly by looking at methods that rank choices of play, not systems that analyse to create a single move solution, as might be generated by system based on Prolog, or similar. That is not a great narrowing, but is a start.

My generalisation is that this is "evaluation"-based play, where a move is selected in preference to alternative moves on the basis of value. Not all such systems deliver a ranking of moves after the best move (such as alpha-beta), but at least the best move has some superior value attached to its evaluation.

The reason for this article? The process needed to create such systems very easily poses difficult engineering problems to manage the creation of such players. This article looks at some of the issues.


The evaluation function: The probable core

Looking at the history of games-playing programs the natural place to start is to represent the game position in data arrays and then provide an evaluation function to measure how good that position is. This can be optionally plugged into a minimax + alpha beta search and the best move simply drops out. What the search has found is the move with the proven best minimum evaluation that it can force. This is a bit non human-like as it assumes that the opponent sees everything and is playing as perfectly as it can. This rather has the characteristic of a theorem prover more than a game-playing strategy. In practice it is not necessarily a perfect strategy for beating all opponents.

Brushing aside the issues of needed quiescence search to ensure that the position evaluated is stable, and the various means of optimising alpha-beta to allow the search to be deeper, you are left with the process of creating a core evaluation that allows positions to be ranked. This, in its simplest form, is just a linear evaluation function.

The structure of a linear evaluation function

Considering early-analysed examples such as Chess, this would identify features of positions that would be given ranked value and summed into one net evaluation term. Fairly obviously material value would be a major term, so that having a rook might be worth 5000 points, but having an isolated pawn worth -100 points. These, with a mirrored set for the opponent, would leave the net value of a position with +ve indicating an advantage and -ve a disadvantage. For chess it is imaginable that you must have many terms to create a plausible player, but Jim Gillogly showed that his program "Tech" managed to perform very creditably, only searching on material value.

Brushing that one example aside, you need to create this multi-term evaluation.

Building a linear evaluation function

Chess offers an attractive basis for creating a plausible linear evaluation. This point is made emphatically by the existence of the well-known book "Point Count Chess" by Horowitz and Mott-Smith, which showed that a path to good play could be achieved by human players exercising such a linear point-based system.

Added to this, Chess is very well known, so skill in Chess is easy to come by and most Chess developers will be competent Chess players (I have to quote an infamous exception, which was by Ken Thompson who created the world champion "Belle". In answer to "how good is your chess?" he replied "I don't play chess!").

The plan at this point is therefore to determine what aspects of a Chess position are important and to code these up. These can be hand-tested against single positions so that the developer can determine whether any changes are good or not.

However this alone will not be enough.

Tuning an evaluation function

Beyond validating that small evaluation changes actually result in the program making a correct move where previously it did not, the process of development needs a methodology to allow the evaluation to be gradually improved. This might seem obvious but actually has many pitfalls. Attempting to classify these:

(1) Hand-testing
You spot an issue and make a change and test it. If it works, you simply move on. Finding defects to correct can be from hand playing or observing games that it has played. This is particularly good for fixing major defects.
(2) Self-play testing
A natural idea is to make the change but include a switch so that you can autoplay games to see if it wins more often. First here I should point out the minefield of assessing whether results are significant, which is already discussed in some detail in Digging further into the Statistical Abyss, from a previous article. However there are other bigger issues.
(3) Auto-play against multiple versions
This may be harder to arrange, but the richness of testing against multiple versions increases the validity of your testing.
(4) Auto-play against rival game engines
Here you test against independent alternative game engines.
(5) Hand tuning against "expert" sets
This is not always possible, but running the evaluation against large numbers of moves made by professional or just strong players, testing for an increase or decrease in the number of correct first choices, is a well-known and proven way of making progress.
(6) Auto-tuning against "expert" sets
This is the natural progression from (3), having a system to dynamically experiment with evaluation weights, crediting progress and penalising negative progress. This core idea is part of the general family of methods called "reinforcement learning" and "temporal difference learning".

The above frameworks seem intuitive and obvious, so, at first glance, it is imaginable that this space can be freely entered and progress will naturally be made.

This is where maybe it is not as rosy as expected, as many game developers will testify. In a less optimistic frame of mind I had another pessimistic candidate title for this article: "Evaluation functions: A million different ways to bang your head against a brick wall".

Well, maybe it is not quite so bad as that, but this process is full of bear traps, which you can fall into and still not be aware that you are actually in a bear trap.

Problem 1: Evaluation holes

This is maybe not so obvious, but any evaluation term you might tune using all the methods above, may be doing odd things you do not expect. The very common candidate is that probably many terms are heavily skewed to compensate for terms that are simply missing altogether. You may tune and re-tune and get the best out of the evaluation you have, but in doing so several terms may be given bad tuning values that are simply to compensate for some other missing or defective term.

Problem 2: Evaluation of a progression of steps

A significant feature to help most games to progress from the start to a final won position is the concept of "hill climbing". Since it is the usual case that a win cannot be detected at the start of the game, it is necessary that the evaluation provides a continuous series of evaluation increases as you progress from start to win. Failure to get this right will probably mean that your game engine will head for some false peak and thereafter never progress.

This may sound like an obvious and avoidable problem, but actually can be very hard to get right. In some simple game, where you just chase some other player character, you may only need to link evaluation to distance, which would be easy. However it is not uncommon that an evaluation may need to evaluate some feature "A" that will naturally progress to create some other new feature "B". An example for A might be the accumulation of pressure on a defensive point and then B an evaluation after breakthrough. The evaluation needs to graduate the value of A such that the start of B is evaluated just higher than the end of A. Again in a simple game this might not be hard as you could massively weight B so that B is always > A. However in a complex game (such as Shogi), there may be many such progressions and indeed the evaluation may need to choose between a path from A->B but also A->C. In that instance you will need to accurately balance alternative goals so that you can pick whether B or C is better. Also the quality of A may be variable. Perhaps A can be escalated to a very high degree, which may invite a very high value of A that exceeds the delivered minimum value of B.

Problem 3: Incestuous self-play testing

This is the easiest first alternative to hand-testing and appears to deliver a method that offers some reasonable means of quantitatively measuring progress. The procedure is to allow evaluation features to be switched on and off and to then test the old program against the new one. This can easily provide nice "proof" as the new program beats the old one.

Sadly this seductively easy method is one of the biggest bear traps. I have, and have seen others, been seduced into believing progress is being made when it is not. The fundamental problem is that what you have proved is that your new version can efficiently exploit the weaknesses of your old program, not that it is fundamentally stronger. This can be compounded by a succession of such versions, each of which beats the previous version. Each new version may be inheriting new extra weaknesses that it accepts in order to simply exploit the weaknesses of the older program. I have seen a progression of A<B, B<C, C<D, D<E but then when comparing A and E you find A>E as E has inherited too many new weaknesses that leave it too vulnerable.

Diluting incestuous tuning by testing against multiple versions

This is a better alternative to the above as you can test your new version against many older versions in a round-robin tournament, and is something that AI Factory already has built into the testbed so that it can be easily run. This naturally reduces the risk of progressive weakening as each new version gets tested against a variety of opponents. The validity of this can be improved by slotting in extra timid and aggressive players to make sure that your new version does not become vulnerable to particular play styles.

Avoiding incestuous tuning by testing against other programs

This may be difficult to arrange, but many other such programs may use some common protocol for playing other opponents. This is increasingly likely if these programs play in computer tournaments, where such a protocol is needed. In time past I used a bitmap reader that literally just watched the screen and tried to spot moves being played by comparing bitmaps. It needed to spot when moves had finished being displayed, but provided access to playing many rival computer programs and helped propel our Shogi-played program "Shotest" to the top of the rankings.

Problem 4: Changing too much at once

From above, you need to test changes made, but an issue is that any changes need significant testing to reach past the point where statistical noise renders the result non-meaningful. The easy temptation is to change more than one thing at a time to "speed things up". Firstly you may introduce a simple bug, but even if not, an entirely plausibly beneficial change can easily makes things worse. You then may deliver a +ve result, but one of your changes actually made it a bit worse whereas the other made it stronger.

The crushing conclusion, borne from painful experience, is that testing has to be one-by-one, otherwise your results will have limited meaning, unless you can be supremely confident that changes have to be good and that no bugs have been introduced. That should be rare.

Added to this, and to reduce the pain, you need to make changes switchable live, so that you can make 2 or 3 changes and then test these all at the same time, each turned on one-by-one. This is a natural way to proceed, given that so many machines are multi-core. Testing here frequently tests 12 or so versions in parallel. The upside is an efficient way to get testing done, but a downside is that you really need to be well organised. Running large numbers of tests in parallel invites complexity and means you are effectively doing many different things at once. If during this you start on some new feature while testing proceeds, you may well get tripped up by losing track of what you are actually doing.

Problem 5: Allowing auto-testing to remove you from the assessment.

The snag with automated testing is that it invites you to look at just the final score rather than the games being played. You may have had 100 games played in a test, but 8 of these exhibited key defects that needed attention. It is too easy to just take the end score and fail to follow-up with hand observation of the games. This leads into wider methods that will be covered in a later article.

Problem 6: Automated tuning against expert sets can bury defects

This is addressed in the partner article Tuning Spades and can be a real issue. Your default evaluation might be generally ok, but suffer some key missing terms or just have bugs. Auto-tuning may mess up your existing handcrafted terms simply to avoid the issue caused by the missing or defective term. The example given in the partner article is "To give a chess evaluation metaphor, if the evaluation cannot handle protecting kingside pawns, it may tune to stop castling altogether. This is avoiding the problem that needs fixing. So the tuned version now hides the situation that needs to be fixed, so the defect has been "concreted-over"."

Problem 7: Assessing the statistics

Maybe this is the most frustrating source of potential self-delusion for the developer as it is not easy to intuitively assess the results of limited testing. The reader is best re-directed to look at Statistical Minefields with Version Testing and Digging further into the Statistical Abyss for earlier assessments of these issues.

Problem 8: Evaluation creation is just too hard

Even with comprehensible games the process of creating an evaluation is a strange process where you can easily find you only have a rather myopic view of what you are doing. Good diagnostic tools (which AI Factory has) can help make this easier, but still the process of evolving an evaluation can feel a bit like an ant painting the Mona Lisa: you lack the comprehensive big picture that allows you to understand the impact of any detail you may be just about to add.

Further to this you may be faced with a game where it is simply too hard to easily extract what is needed from the position to create an evaluation function. This might invite experimentation, just looking for terms that appear to result in "better" play. You may find such ill-understood tweaks that may deliver results but ultimately may lead nowhere. To search for a metaphor for this, you might design a robot to solve a maze, but in doing so you find that if the robot limps a bit with its left leg, it ends up turning left more often, which is acknowledged as a credible means of solving mazes (or always turning right etc.). Here your slowed up limping robot out-performs the non-limping one, but this is not a good solution.

Problem 9: Resorting to incremental evaluation

A common method to make evaluation faster is to replace the node-by-node evaluation with an incremental update between adjacent nodes. This uses a map of all pieces on all squares and updates the evaluation by subtracting the square value moved from and adding the square value move to. This needs a complex evaluation once at the start of the move to be played to create these piece/square maps and then using these throughout any search.

This is wonderful in that it is extremely fast, so allows the program to examine many more search nodes. The defect is that the map created at the root may become horribly out-of-date by the time the terminal node is reached. In practice a blend of a small static evaluation subset combined with some incremental evaluation can be the best option.

Progression to non-human readable evaluation

In all the examples above the underlying assumption is that the user is working with a human-comprehensible evaluation function, but this may not be the optimum solution. For example the most successful method used for Shogi (Japanese chess) is to create a table of all possible combinations of 2 pieces on the board (e.g. a rook on 1A and bishop on 4B) and give this a weighting. This results in some 2 million different values to maintain. Very obviously this cannot be hand tuned so needs to be trained against human expert games. This is repeated thousands of times until the trained set of data yields the desired game performance.

This then removes the developer from any chance to either directly work on or understand the resulting evaluation. It is rather like looking at one of those magic pictures that appear to be constructed out of noise but where there is a hidden 3D picture but, unfortunately, in this instance there is no 3D image to be seen.

Maintaining Sanity

Creating programs to provide AI for games is fun but, from the notes above. It is not always an easy or even pleasant trip. If offering advice I'd suggest making yourself well aware of what can go wrong, but also I'd recommend investing in creating good development tools. AI Factory has such tools and we fully understand just how important these are.

Jeff Rollason - September 2012