When developing my very successful Japanese Shogi program Shotest I had a version creation system dependent on #define flags to explicitly set every coding option. This was supported by add-in bespoke programs that read the source files to determine the #define setting so that test results could be exactly matched with version settings. These were automatically archived so that the integrity of each version was could be reliably verified. This is already described in the article "Lessons from the road - Championship level AI development". This solid reliable method meant that version testing had to be between different binaries. However it was a little inflexible.
The following article looks at the development evolution beyond this, and where it has been both very good and very bad.
If a program version is hardwired then to test it against another version will need some external communication. The Linux programmer would naturally adopt pipes or sockets to do this as these are purpose-built for such work. However with testing on multiple PCs and run on varying networks I wanted something completely universal, so created a system where moves were exchanged via the file system. This was very robust and allowed testing all on one server or scattered between single PCs, each of which might run 2 versions on one PC (note this was before the days of dual or quad core PCs !). All this served me well and was at the root of the early great success with Shotest.
Of course there are no free lunches and the downside of this robust system was a lack of flexibility. Anything you wanted to test needed a new binary. This conspired to restrict what testing might actually be done. Through all the existing add-on tools it was also possible to assess cross-referenced matches between multiple version testing. However the underlying structure still made it harder to easily try out many new things, so inhibited experimentation.
With the creation of our generic framework at AI Factory we added a system of generic switches that made it quick and easy to experiment. This whole system was designed to allow the user to run many different versions in tournaments within single binaries. This inevitably seduced us away from thinking about inter-play between binaries. It was so convenient, fast and seemingly easy to verify.
However this flexibility came at a price.
Falling of a real tightrope is obviously bad. However the latter has one advantage in that at least it is obvious when you have fallen off! The insidious property of falling off a development tightrope is that this might go undiscovered, but the development damage might be as bad as the real thing. This is discussed and warned against in "Lessons from the road" above. You may be testing a new version but a program change to some common code might be broken, in which case both the existing reference and new version may be equally crippled, but this is not exposed in looking at the play results between them.
In the "Lessons from the road" article it is stressed that you need reference games or puzzle sets that need to be tested. However these are optional procedures at any stage and any such skippable procedure will inevitably be skipped from time to time. It is high risk heavily depending on optional steps. A little complacency is inevitable after a few positive trials. If this is not trapped early it can be hard to unravel the failure. Test sets get out-of-date so if versions become too distant the tests may no longer be valid.
Separate cross-binary testing of AI allows a universal test: does A beat B? This test migrates between any generation of version and needs no bespoke defined test to be carried out (although the latter can still be useful for detecting very small errors). The reason we had not done this in the new Fireball framework was that we had no obvious easy path to allow this to work across all games, where maybe there are more than one player per game. Happily the act of documenting this deficiency inspired the solution. A little more thought and it became clear that our framework could allow this so that this is now automatically supported for all our engines, new or old, with no need for any retrospective massage of old code to achieve compatibility.
Although superficially what we do now mirrors what we implemented back in 1998 for Shotest, in that it shared moves via file transfer, the underlying structure is very different. In those days the actual move was sent as a character string. Our new engines however only directly support importing and exporting entire game records, with the initial state and the move history. This is encapsulated in a generic Fb_GameState record, shared by all games. Therefore calculation of moves is state-driven. Within the game control taking back a move results in the entire game record being rewound and forwarded to position-1. So moving via file sharing writes out an entire gamestate after each move. This may sound expensive but these gamestates are written to a ramdrive and, in practice, the time spent calculating moves is much much greater than the time spent saving gamestates.
Of course an old binary might have a larger gamestate, but actually this can be easily padded. The end product is an extra generic command in the command console that runs such a game. This has not required any retrofit for existing engine sources. It works immediately out of the box.
This whole process reveals a generic malaise in development procedures as a whole. It seems a vital ingredient in achieving success is to avoid needing the developer to be onerously diligent. Integrity tests need to naturally flow from what you are doing, rather than being a branched activity that can be overlooked or skipped. Development often can become a bit frantic, particularly with deadlines, so things that can be skipped will get skipped or "knowingly" forgotten. The development infrastructure needs to make it natural and easy to get the level of validation needed.
At the same time the infrastructure also needs to simultaneously make it easy to try out and test ideas, but this potentially is at odds with the demand for integrity. A good development framework needs to allow a high degree of fragmentation around multiple tests and versions, but this invites errors to creep in.
We have not yet created that perfect magic bullet that solves all but we have plugged one hole that at times might threaten a project. Each time we find some solution it is vital that this can seamlessly be shared by all our engines with little or no retrofit. Even within writing this article ideas spring to mind. We already have an automated version archiving system: we could easily also embed into this a check that notes that "N" archives have been created with no integrity check. It could be used to ping the developer that it was time to do this. This would be automatically embedded in any testbed binary we created.
We have not reached our total Nirvana yet, but we are gradually edging towards there, with an infrastructure that significantly compensates for all too easily manifested human frailty.
Jeff Rollason - October 2015