How do we know if a game design is any good? How do we know if a particular feature makes a game better, worse, or has no effect?
Until fairly recently A/B testing was a practice most of us associated with market research teams as they adjusted features or advertisements so as to increase a company’s return on investment. However, in the last few years, particularly with the rise of social games that live and die with their ARPU, the idea of empirically testing how particular game features influence metrics of “success†has had a growing relevance for designers as well. Of course, traditionalists might prefer to rely on intuition when designing their games – indeed, an ongoing discussion I’ve witnessed both at GDC and at GDC Online last October has been not just how but if data should be used for informing designs. However, for designers trying to maximize return on investment, particularly those operating on a very limited budget, user data (such as number of levels completed, amount of time spent playing, retention rates, and money spent) can offer invaluable insights into what “works†and what does not.
Unfortunately, as Erik Andersen (PhD student at the University of Washington's Center for Game Science) pointed out in his session last week, much of the A/B testing done by studios focuses on relatively small variables (such as the color of a particular button). Even worse – at least for us academically minded folk – the results of industry A/B testing are not usually made publicly available. This is problematic in that every party has to independently reinvent the wheel, which is rational from a competitive sales perspective, but hinders the capacity for developing shared theories of engagement.
To this end, Andersen and colleagues have brought A/B testing to the lab. With large N sizes (101k players), higher-order manipulations (the relative presence of features common across many games) and looking across types of games (simple and complex), they are attempting to produce A/B generalizations that can contribute to an academic, shared knowledge of what improves a game.
Andersen presented findings from a series of studies that compared the relative payoff of particular features in the games Refraction, Hello Worlds (both relatively simple games) and Foldit (relatively more complex), all created as projects within the Center for Game Science:
Andersen was quick to note that the games examined are fairly limited in scope compared to the wide variety of games available to players, but reiterated that, for these games, the effects observed carried across games for thousands of players. In turn, he suggested that future work would need to look at more games so as to examine the generalizability of findings.
I found these studies be interesting, in light of some of the counter-intuitive findings yielded (e.g., no effect for audio) as well some of the moderating variables identified (e.g., game complexity). Granted, this form of “academic†A/B testing is young and needs to deal with certain issues of rigor (for example, no subjective reports were collected to corroborate the levels of engagement assumed through behavioral measures), but this work nonetheless represents a solid step towards making A/B testing a more scientifically minded enterprise. In bringing industry questions and metrics into a lab setting, analyzing data without the bias of financial pressures, and then sharing results publicly, game design can better draw upon the strengths of iterative theory development and hypothesis testing that make up any science.
GDC: Game Optimization through Experimentation by Jim Cummings, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Leave a Reply