# A technical note for game analysts Twice now I have reviewed and subsequently rejected papers that attempt to analyze the price of games in terms of real currency and the price of virtual items in terms of virtual currency. I've had to do so on the basis of a fairly serious oversight in the authors' statistical analysis, which reflects a lack of understanding of the very specific assumptions made by workhorse statistical methods like ANOVA and Ordinary Least Squares (OLS) regression. These are standard "first approximation" methods in academic research, but they are nigh state of the art in game analytics.

To help drive this discussion, I'm going to pick on this post on Gamasutra, which was intended to provide readers with a basic introduction to linear regression. The author of that article suggested a model of player churn in the month of July given by

(user cancelled) = x + (y1 * logins) + (y3 * friends_cancelled) +  (y5 * achievements) + e

In this case, the variables logins, friends_cancelled, and achievements can all be thought of as continous variables, whereas the outcome - user cancelled - is dichotomous, i.e. it is equal to 1 if the user cancelled and 0 if not. e is an error term that is included in every regression, and is assumed to have a mean of zero. (This model  is known in statistics and econometrics as a linear probability model.)

This is a pretty simple model to estimate with any statistical package, free or otherwise. The author of the gamasutra article estimates the y's, which are called the marginal effects in this case. That is, the value of each of the y's is the expected amount of change in the probability of a user cancelling, given a one unit increase in the corresponding independent variable. So when we look at the estimation results

(likelihood of cancelling) = 1.31132 – (0.0470642 * logins) + (0.0567763 * friends_cancelled) – (0.0795353 * achievements)

the interpretation of the coefficient on logins is "For every additionl time a user logs in in the month of July, we expect his/her probability of cancelling to fall by .047." Similar interpretations work for the other variables.

This is very useful information. The author tells us that with an R-squared of about 0.9 and a high F-statistic, we might at first be very satisfied with this model. But there is a very serious flaw here that could completely invalidate the estimates. Specifically, this model suffers from an endogeneity problem: If your friends cancel, you're more likely to cancel; but if you cancel, your friends are more likely to cancel. In other words, the value of friends_cancelled is affected by the value of user_cancelled, and vice versa.This issue of simultaneity violates a basic assumption in simple linear regression (specifically, that the explanatory variables and the error term are never correlated). When this assumption is violated, the parameter estimates - those useful marginal effects - are biased, potentially by a lot. Think about it: if user_cancelled causes friends_cancelled and friends_cancelled causes user_cancelled, then the coefficient on users_cancelled will be inflated because it "detects" and reports the causation in both directions. The true effect could be as little as half of the reported 0.057.

Now, if all we wanted to do was predict whether someone will quit or not, we could move on to the next problem, because the overall fit of the model is not affected by the endogenity. But what if we want to do more? The model might predict quits quite well, but that doesn't necessarily help you keep the player from quitting. In other words, if the goal is to just say "Paying user 37 is likely to quit next week", then the above model may suffice. But what if we want to know how to keep user 37 in the game? We might like to ask a question like "How much should we spend to keep user 37's non-paying friends in the game so that user 37 will also stay in the game and pay us more." In that case, we have to deal with the endogeneity problem.

So, how do you do that? Normally, you have to find an instrument for the endogenous variable (friends_cancelled, in this case). A variable that instruments for friends_cancelled is one that contributes to changes in the endogenous variable but does not affect the  value of the dependent variable.

Luckily, in this example, we can fix the problem by just changing the friends_cancelled variable. Rather than trying to predict whether a player quit in July based on whether her friends also quit in July, what if we ask whether a player quit in July given how many of his friends quit in June? In that case we can argue that a person's friends weren't responding to his decision to quit, and so we can rely on the estimates. So here's my proposed model:

(user cancelled this month) = x + (y1 * logins) + (y3 * friends_cancelled_last_month) +  (y5 * achievements) + e

Note that you could just as easily change the relevant periods to "this week" and "last week". You could also include other variables like friends_cancelled _one_week_ago and friends_cancelled_two_weeks_ago. This kind of model will do a much better job of telling you the true effect of friend churn on player churn.

In general, correcting for the endogeneity problem is no easy task. In this example, and as with most games, we are lucky to have data with enough granularity that we can simply reformulate the analysis and avoid the problem altogether. But especially when dealing with markets, analysts and researchers have to tread carefully, because the endogeneity problem always comes up when we want to ask interesting questions. Recognizing this problem and dealing with it is key to producing valuable empirical analyses of game economies.