With our articles Backgammon Programs Cheat: Urban Myth?? and Statistical Minefields with Version Testing in mind, we have returned to the difficult area of assessing the significance of statistical results. This topic has been further drawn to our attention by the 1,000,000+ people who have played our Backgammon program. Traffic from these users has illustrated the issues that people can have when assessing the significance of test results. In the latter case many people are certain our program rolls non-random dice but this is a false perception...
This can also (paradoxically) be further undermined by the ill-directed application of common statistical theory, which can easily lead to misleading conclusions.
My general thesis behind this is that while humans are astonishingly capable of complex intellectual and intuitive reasoning, which is why we appear to be able to, not only dominate this world but, now have footprints on other worlds too, that human intuitive grasp of statistical probability is actually quite poor.
This article takes another look at the application of statistical probability to games development.
First I am not going to claim I have some advance profound intuitive grasp of this topic. In reality I have also allowed myself to be mesmerised at times by results that appeared to convey meaning, but actually said very little.
In defence of that frailty I created a program BINOM.EXE that reverse-engineers the binomial distribution so that a set of results can be assessed to determine not only the significance of results but also the probability of different possible p and q values that would fit the observed distribution. This was my first arm of defence.
This very useful little program is already detailed in Statistical Minefields with Version Testing, and allows the user to provide some win + loss totals and from that deduce the probability that X > Y or that the probability of X is greater than a given value etc.
For example if player A beats player B 7 times and loses 3 times, then this looks like it might be a good indicator that A is stronger than B. However a dip into BINOM.EXE gives the following analysis (showing only part of its output):
BINOMIAL v6 Aug2004 - Jeff Rollason Population n=10, a : b 7 : 3 'p' calculated from a/n is 0.7000 The probability that A > B (p>q) is 0.886719 and that A < B (p<q) is 0.113281 p=0.887 expressed as a fraction/ratio is ( 7.8 : 1 ) Results 7 + 3 give the following range of values for 'p' (as %) within 0.01% 'p' within 50% probability within range 58.8% <-- 70.0% --> 78.8% 'p' within 67% probability within range 54.2% <-- 70.0% --> 82.3% 'p' within 95% probability within range 38.0% <-- 70.0% --> 91.7% 'p' within 99% probability within range 29.0% <-- 70.0% --> 95.4%
This tells us that it is indeed more probable that A is stronger than B, but that actually the chance that A > B is just 88.7%, so B might be stronger (11.3% chance) and so this is just a too small a sample.
If we extended the play trial further and achieved a win rate of 49 for A and 33 for B, then this is a much less impressive win percentage, but now we feel much more confident that A is stronger:
BINOMIAL v6 Aug2004 - Jeff Rollason Population n=82, a : b 49 : 33 'p' calculated from a/n is 0.5976 The probability that A > B (p>q) is 0.960790 and that A < B (p<q) is 0.039210 p=0.961 expressed as a fraction/ratio is ( 24.5 : 1 ) Results 49 + 33 give the following range of values for 'p' (as %) within 0.01% 'p' within 50% probability within range 56.0% <-- 59.8% --> 63.3% 'p' within 67% probability within range 54.5% <-- 59.8% --> 64.8% 'p' within 95% probability within range 48.9% <-- 59.8% --> 69.9% 'p' within 99% probability within range 45.5% <-- 59.8% --> 72.9%
Now our confidence is 96% that A is stronger.
The above process invites potentially dubious practices. If you have a set of results that do not conform to the confidence limits you desire, the temptation is to just keep churning out results until the threshold is satisfied. This has some validity, but invites the user to stop at some point when randomness has briefly pushed the confidence bounds over the desired threshold. It the experiment had been continued there is a good probability that the confidence level may drop back, even if only briefly. It is also possible that this threshold will be briefly crossed, never to return on more substantial sampling.
This highlights the key basis in which these statistical distributions depend: the base assumption is that these are random samples of that distribution, not samples the observer has somehow selected. Abuse of this assumption easily leads to a number of paradoxes:
Armed with your trusty BINOM.EXE, you can march on generally accepting and rejecting sets of results, depending on whether they conform to the required level of confidence that you need to be able to assert that the results are good.
A simple example where such a strategy is flawed is as follows:
Imagine that you have a games-playing program with a tuneable parameterised evaluation function. Given this, a plausible means of finding the correct values for the evaluation is to simply run multiple trials with randomised values. From this you pick the trial with the best result and this then becomes your "best" set of parameters. To add confidence to your result you test these with BINOM.EXE to confirm that, within statistical bounds, your new parameters are indeed a strong indication that these are good parameters.
This all sounds very reasonable, but lets construct an example that shows just how flaky this strategy could be.
In this pseudo-example below we can emulate testing of 10 possible parameter sets by instead just running 10 identical random trials of "heads-or-tails". This clearly should show no meaningful differentiation between trials as they all test exactly the same thing. The following is a completely random trial run on this basis.
1. Wins 52 Losses 48 2. Wins 48 Losses 52 3. Wins 61 Losses 39 4. Wins 50 Losses 50 5. Wins 53 Losses 47 6. Wins 45 Losses 55 7. Wins 54 Losses 46 8. Wins 65 Losses 35 9. Wins 47 Losses 53 10. Wins 44 Losses 56
However here we can see that our 3rd trial and 8th trial are skewed away from the population expectation of 50/50. If we run the most skewed of these, trial 8th, through BINOM.EXE:
Population n=100, a : b 65 : 35 'p' calculated from a/n is 0.6500 The probability that A > B (p>q) is 0.998673 and that A < B (p<q) is 0.001327 p=0.999 expressed as a fraction/ratio is ( 752.6 : 1 ) Results 65 + 35 give the following range of values for 'p' (as %) within 0.01% 'p' within 50% probability within range 61.7% <-- 65.0% --> 68.1% 'p' within 67% probability within range 60.3% <-- 65.0% --> 69.4% 'p' within 95% probability within range 55.3% <-- 65.0% --> 73.9% 'p' within 99% probability within range 52.2% <-- 65.0% --> 76.4%
This analysis is telling us that the probability that A > B is now 99.8673% !! This is clearly nonsense. However if you ran such a test with real parameters it does appear that trial 8 is a good result. Here this simplistic approach to statistics has not been very helpful. Of course, in the limiting case, we could run huge number of trials, in which the validity would be relatively sound. With infinite trials the results are exactly correct. However running huge numbers of trials is not helpful as it will undoubtedly take too long.
We could just run 10 times the number of trials:
1. Wins 501 Losses 499 2. Wins 507 Losses 493 3. Wins 495 Losses 505 4. Wins 495 Losses 505 5. Wins 518 Losses 482 6. Wins 492 Losses 508 7. Wins 492 Losses 508 8. Wins 525 Losses 475 9. Wins 483 Losses 517 10. Wins 504 Losses 496
Now the figures look more flat. The most skewed of these is the 5th one, which gives a BINOM.EXE value of A > B is 87.2%, which is poor enough for us to ignore.
On repeating this same trial again I got:
1. Wins 497 Losses 503 2. Wins 516 Losses 484 3. Wins 518 Losses 482 4. Wins 521 Losses 479 5. Wins 485 Losses 515 6. Wins 497 Losses 503 7. Wins 518 Losses 482 8. Wins 489 Losses 511 9. Wins 543 Losses 457 10. Wins 490 Losses 510
Now we have one substantially skewed result, the 9th one, and this time BINOM.EXE tells us that the chance of A > B is 99.67%, which again is nonsense.
So we have shown that this method is looking highly suspect and simply piling in substantially longer runs is not going to remove the problem.
At this point I do not have a magic statistical methods from my toolbox to test the significance of such trials that I can offer. However as-is this clearly does not work.
It is this paradox that also contributes to many people observing "clearly non-random dice rolls" in our backgammon program. While dice rolls look normal the user ignores them. Once they eventually see a long sequence of doubles they present some maths that show that the chance of this small sample result is very low, but this ignores the number of times that the results were normal.
Looking at the backgammon paradox another way, millions of people are playing our program, and statistically most of these people will see normal dice rolls, but a large number will also see skewed ones. The chances of there being people finding such skewed results is very high indeed.
To add injury to this paradox, just as many people will see the same defect occurring to give them "good" dice rolls as those who get "bad" rolls, but they will never complain as they could not imagine that the program might be cheating in their favour!
The above paradox demonstrates an issue that has, at many points in my game development history, managed to lead me astray. I have seen this also undermine colleagues pursuing similar development goals. The constraints imposed by such development need you to confine the amount of testing so that you can efficiently progress to the next stage to test more parameters. The need for limited testing naturally steers you into making this kind of mistake.
Superficially this looks a bit terminal, if radically increasing sample sets does not solve the problem. If nothing else, you need to be aware of the paradox.
However there are ways to reduce the danger of following bad test results. In many ways the following method mirrors some of the methods I use in game evaluation. The basis of this is not to prove the validity of results at each stage but to reduce the probability that bad results are selected for given effort.
If we consider the original trial, but add in the emulation of a "significant" set (set 1 is skewed so it wins 55% of the time), we can now repeat this trial 3 times:
1. Wins 58 Lose 42 Wins 64 Lose 36 Wins 62 Lose 38 WWW 2. Wins 49 Lose 51 Wins 51 Lose 49 Wins 52 Lose 48 LWW 3. Wins 47 Lose 53 Wins 44 Lose 56 Wins 56 Lose 44 LLW 4. Wins 45 Lose 55 Wins 48 Lose 52 Wins 61 Lose 39 LLW 5. Wins 49 Lose 51 Wins 52 Lose 48 Wins 36 Lose 64 LWL 6. Wins 45 Lose 55 Wins 46 Lose 54 Wins 56 Lose 44 LLW 7. Wins 50 Lose 50 Wins 47 Lose 53 Wins 43 Lose 57 =LL 8. Wins 41 Lose 59 Wins 56 Lose 44 Wins 49 Lose 51 LWL 9. Wins 62 Lose 38 Wins 56 Lose 44 Wins 53 Lose 47 WWW 10. Wins 52 Lose 48 Wins 46 Lose 54 Wins 51 Lose 49 WLW
In the output above the number of net "wins" and losses" is marked at the end so that the sets with winning or losing can be seen easily. From this our "significant" result test set 1 did achieve a +ve result, but so did 9 and 10. In fact 9 was a "better" result.
A strategy for handing this test set is to simply run the first run of these trials and then weed out "unpromising" results. We could actually just take the top 50% of these sets, 1, 2, 5, 9, 10, even though 3 of these were losing results. We then just re-test these sets for another trial and repeat the process to narrow down to a further subset of 3 for a 3rd trial.
Taking this simplistic approach we would then examine sets 1, 5 and 9 from the 2nd trial and test just these on the 3rd trial. From that the skewed set 1 is correctly chosen in the 3rd trial.
Of course this is just one trivial example trial, but repeating this model over a long sample gave a 55% chance of the skewed sample set correctly being selected as the top set.
How does this compare to a trial of 180 per trial? ("180" to match the total number of tests in the 3 stage trial above. i.e 10x180 = 10x100 + 5x100 + 3x100).
This 10x180 flat trial gives a 45% chance of the top set being selected, so the staged trials with eliminations offer a better probability that the best set is chosen compared to a flat trial with same number of tests per set.
This is a simple test example, but the general idea holds. It offers a better chance of finding the best parameter set by simply eliminating likely weaker sets early on. If a parameter does badly in the first trial there is a good probability that it is not the best set, so why not dump these from the trial?
Arguably this is dangerous as we are throwing away sets on the basis of just a few tests that might actually be the best sets, but note that testing a set from a larger flat trial still offers the danger that the best set does not come top.
The success of this makes sense. It is squeezing more relevant information from limited testing. It is testing sets that are more likely to be the correct sets to test at the expense of not having a complete ordered assessment of all sets. This has some parallel with alpha-beta, which also strives to select the best result, but at the expense of losing ordering of "losing" moves. It also has a parallel with a knock-out tournament compared to round-robin. The former has less matches but still does a good job of detecting the "best" competitor and the expense of losing the ordering of losing competitors.
Note that the model above is very simple. In practice it can be made much more effective by setting cut-offs based on distance from the best set, so that an ordered set of percentage wins : 55,54,50,20,20,15,12,,11,10 might select just 55,54,50 but 55,54,50,50,49,15,12,,11,10 might select 55,54,50,50,49. This squeezes more information from the limited trials and will allow more efficient trials.
Jeff Rollason - May 2011