Thursday, November 13, 2014

Game Outcomes Project Methodology


This page is designed to explain the technical details of the Game Outcomes Project, as the technical appendix to a 5-part series.


The Game Outcomes Project team includes Paul Tozour, David Wegbreit, Lucien Parsons, Zhenghua “Z” Yang, NDark Teng, Eric Byron, Julianna Pillemer, Ben Weber, and Karen Buro.

In October and early November of 2014, the Game Outcomes Project team ran a survey targeting game developers asking roughly 120 questions about each respondent's most recent team-based game development effort.  Questions centered on development culture, teamwork, project management, and the project's outcome.

We received 771 responses, of which 302 were completed, and 273 referred to projects that were neither cancelled nor abandoned.

We chose to exclude responses related to cancelled and abandoned projects for reasons we explain at the end of this post.  This blog post explains our survey design methodology and our  analytical methodology for the remaining 273 questions for those who wish to delve into the statistical and mathematical details.

In other words, this is the page for all of the gory technical details we didn't want to put in the other articles.

You can download the survey questions here: (PDF).

We will also make our full data set available shortly after publication of the final article.

Survey Design Approach


Our survey was designed as follows:
  • Page 1 was a merely a qualifier page.  In order to ensure that we received responses only from our target demographic, we asked respondents to only complete the survey for projects with known outcomes within the last 3 years which had a team size of 3 or more and on which they had served in a development role.  Although we posed these questions as four check boxes, this was merely to dissuade casual survey-takers or inappropriate respondents, and the answers to these questions were ignored in our analysis.
  • Page 2 contained background questions such as team size, production methodology, financial incentives offered to the team, and project lifetime.
  • Pages 3 and 4 contained a little over 100 questions around teamwork, culture, and various other factors relating to game development using a 7-point Likert scale.
  • Page 5 asked only four questions, asking respondents to rate the outcome of the project along a 6- or 7- point Likert scale for four different dimensions: project delays or cancellation, return on investment, aggregate review scores (MetaCritic or GameRankings), and the team's internal satisfaction with whether the project had achieved its goals.
In designing the survey, we took several steps to help reduce the risk of bias.  Although some survey respondents have complained about these aspects of the survey, they were entirely intentional:

  • We occasionally asked important questions twice, once with a positive frame (“I rarely had to work overtime”) and once with a negative frame (“I had to work a lot of overtime”). We did this as we were concerned that if we asked all questions with same frame, respondents who felt positively or negatively about a project would be tempted to simply scroll through the list and click somewhere along the “agree” or “disagree” part of the spectrum.  We felt that intentionally disrupting this behavior by occasionally shifting the frame of the questions would force the respondent to pay attention and consider each question individually.
  • We sometimes asked slightly different versions of the same question in an attempt to tease out the cause of some phenomenon. For example, we had five questions related to “crunch” and overtime, and one of those questions asked about voluntary overtime while another asked about mandatory, imposed overtime. We felt we could use these subtle distinctions to tease out deeper causal factors (spoiler: we could, and we did).
  • We deliberately removed the section names and enabled randomization of the question ordering via SurveyMonkey. Although this led to a large wall of questions that was off-putting to some respondents, we felt that announcing the section names openly might tip our hat as to what we were looking for in each section, and allowing the answers in each section to remain clustered together would likely have a similar effect and allow respondents to simply use the same answer for all questions in that group. By randomizing the ordering of all the questions on a given page, we would greatly reduce the likelihood of these sorts of phenomena.
  • We added a qualification page at the beginning of the survey asking respondents to continue only if they had worked on a team project in a development role within the last 3 years that had some sort of known outcome. We deliberately wanted to avoid continuously-developed projects as the outcomes of these types of efforts are much more difficult to quantify.

Participants


We recruited participants from several different sources.  Of the 771 responses we received, our best guess as to the distribution (based on the timings of the various surges we observed in survey completion) is as follows:

  • ~100 responses: posts on TheChaosEngine.com (internal, private game industry forums)
  • ~120 responses: announcement on Gamasutra.com
  • ~400 responses: direct IGDA mailer exclusively to IGDA members
  • ~150 responses: Twitter announcements, various other forum posts, and other/unknown sources
Given the diversity of sources both here and in our final responses, we feel comfortable asserting that our results represent many different teams of many different sizes (though with a moderate bent toward AAA development over indie development, based on the "final team size" results).

However, we have no way to track completion rates, so it's impossible for us to determine which of the final 273 responses (all responses which were fully completed and referred to non-cancelled projects) derived from each source.

The Aggregate Game Outcome Score


Lacking any way to objectively define “success” or "failure," we decided that the best way to quantify the outcome was through the lenses of four different kinds of outcomes – critical reception, return on investment (ROI), delays or cancellation, and the team’s own perception of its success or failure – and later combine them into a single “outcome” score during the post-survey analysis phase.  This led to four questions, each of which allowed answers along a 6- or 7-point scale.
  • Delays: “For the game's primary target platform, was the project ever delayed from its original release date, or was it cancelled?”
  • ROI: “To the best of your knowledge, what was the game's financial return on investment (ROI)? In other words, what kind of profit or loss did the company developing the game take as a result of publication?”
  • MetaCritic: “To the best of your knowledge, was the game a critical success?”
  • Internal: “Finally, did the game meet its internal goals? In other words, was the team happy with the game they created, and was it at least as good as the game you were trying to make?”

We base the decision to combine the four outcome values into a single score on two factors.  First, since all four questions are related to different aspects of project outcomes, it seems intuitively obvious that they are related, and that all of these four different aspects of the project’s outcome must have come after the end of development, and had to have been caused by the development cycle itself (or other factors, such as consumer tastes and marketing spend) rather than by each other.

Secondly, all four outcomes are strongly positively correlated to one another, as shown in the scatterplots below.

Figure 1. Animated GIF of cross-correlations betweeen all four game project outcome factors (on a 4-second delay).

Note that this image is an animated GIF with a 4-second delay; if you don't see it changing, wait a bit longer for it to finish loading.  Also note that all data has been randomly "jittered" slightly for these charts to make coincident data points more visible.  Note also that all four of these output dimensions have been "normalized" on a 0-1 scale with a "1" being the best possible outcome (the game shipped on time, critics loved it, it made huge piles of cash, or the team was thrilled with the fruits of their own efforts), and lower values being quantized equally along the 0-1 scale depending on the number of gradations in the question.

Each of these correlations has a p-value (statistical significance) under 0.05 (the p-value gives the probability of observing such data as in our sample if the variables were be truly independent; therefore, a small p-value can be interpreted as evidence against the assumption that the data is independent).  This makes it very clear that the four aspects of game project outcomes are interrelated.

We eventually settled on a simple non-weighted sum for the aggregate outcome score.  Although we were tempted to give each outcome value a coefficient, there is no objective basis for determining the coefficients.

We assigned the best possible outcome for each factor (terrific reviews, makes lots of money, no delays, team couldn't be happier with it) a value of 1.0, and we gave a worse outcome a correspondingly lower score (closer to 0) along a linear scale depending on the number of gradations in the questions asked (some of the outcome questions were asked on a 6-point scale, others on a 7-point scale).  We then added them together.

        Score = 25 * ((Delays) + (ROI) + (MetaCritic) + (Internal))

Note that the multiplication by 25 effectively converts the score to a 0-100 range, since each of the 4 outcome values is between 0 and 1.

We also experimented with exponents for each factor which we tuned in Solver to try to maximize the cross-correlations between the outcome factod, and with multiplying them as a probability value instead of simply adding them.  However, we found that simply adding the four outcome factors, in addition to being simplest, achieved the highest correlation, and we could not justify the additional complexity of any other approach.

Missing Data Handling


Roughly 5% of the data in our survey was missing, as we allowed respondents to leave a small number of questions blank on pages 3-5 of the survey.

For the majority of the responses, we simply averaged the non-blank data for each question using the AVERAGEIF() function in Excel, and then used this to fill missing data for that question.

For the four outcome questions, given the critical nature of these values, we felt a more exhaustive approach was required.  Here, we used the mean of two values: the average value of all non-empty responses to that question, and the average of all other non-empty outcome values for that response.

Correlations and p-Values


As nearly all of our data used a Likert scale, the Spearman correlation was a more appropriate measure than the more commonly-used Pearson correlation coefficient.  This required us to use the SCORREL() function from the Real Statistics Resource Pack available from real-statistics.com rather than the built-in CORREL() function in Excel.

In practice, we found there was little difference between the two -- typically, a difference of less than 0.02 for nearly all of our correlations, though occasionally (in less than 2% of cases) the difference was as large as 0.07.  However, despite these nearly-identical results, we felt it was essential to go the extra mile and use the more-accurate Spearman correlation coefficient.

We used a p-value threshold of 0.05 for each factor in our survey; however, only 4 questions had p-values between 0.01 and 0.05, so had we used a lower p-value threshold of 0.01, this would have only invalidated 4 of our 120 questions, which would not materially change our results.

In cases where we compared a binary variable to the combined outcome score, we used the Wilcoxon Rank Sum Test to determine p-values (via the WTEST() Excel function provided by the Real Statistics Resource Pack).  This includes the various types of financial incentives discussed in article 1.

In cases where we compared a variable with several discrete, independent values to the combined outcome score (such as which production methodology or game engine was used, as discussed in the first article), we used the Kruskal-Wallis test to determine p-values (via the KTEST() function provided by the Real Statistics Resource Pack).

Cancelled and Abandoned Projects


We decided to ignore responses that turned out to be for cancelled or abandoned projects. This was a tough decision, but the fundamental problem is that we have no good way to assign an outcome value to a game project that was cancelled or abandoned before completion – the “outcome” has to include its critical reception and ROI for a real direct comparison, and since it was cancelled before completion, these will never be known.

Initially, we felt a 0 was a proper score for a cancelled project. This makes intuitive sense, as surely a cancelled project is the worst possible outcome and has no value, right?

But this isn’t necessarily the case. There’s a world of difference between a team abandoning what could have been a great game 3 months into development because they decided that working on some other, even greater game project would be a better use of their time, and a team slogging through a multi-year death march, impairing their health with extended overtime, and ending up with divorces, only to see their game cancelled at the last moment. Those two games should score very differently in terms of their “outcome,” and for cancelled or abandoned projects, that data does not exist.

There’s also the simple fact that many times, cancellations and abandonment are caused by factors outside the team’s control. Perhaps a key employee ran into health issues, or perhaps despite a team being terrific and working on a very promising game, the parent company ran out of money and had to close up shop. These kinds of stories happen all the time, and of course there would be no way for our survey to detect these things.

That’s not to say that cancellation and abandonment are entirely random. However, we found that the correlations with cancellation were generally far lower, and only a handful of variables correlated reasonably well with this outcome. We hope to discuss the cancellation issue further in a future article, but for main part of our series, we focus solely on non-cancelled game projects.


Predictive Modeling


We looked at a number of different ways of building predictive models that would use all the inputs to predict the aggregate outcome score.  We imported the data into Weka and tried the following models:
  • Linear Regression Full: 0.82
  • Linear Regression 10-fold: 0.51
  • M5 Prime Full: 0.89
  • M5 Prime 10-Fold:0.59
  • Additive Regression (20 learners) Full: 0.81
  • Additive Regression (20 learners) 10-fold: 0.62
We also built two linear regression models in Excel, limiting ourselves only to inputs which exhibited statistically significant correlations (p-value < 0.05) with the aggregate outcome score (this excluded only roughly 30 of the ~120 survey questions).  The full linear correlation achieved a correlation of 0.82, identical to the Weka linear regression above.

However, to avoid overfitting, we later constrained the linear regression so that the correlation coefficients had to have the same signs as the correlations of those underlying inputs.  This gave us a correlation of 0.73 -- still an excellent correlation.

We also ran cross-validation with separate parts of the data set (excluding 20 data points at a time, roughly 10% of the data set) against this linear regression, with identical results.

We ultimately used these linear regression coefficients to help us identify the most useful and relevant independently predictive variables in the Self-Reflection Tool and to construct the linear regression model provided in that tool.

Data Verification


We asked respondents to subjectively grade nearly everything in our survey.  Therefore, we cannot independently verify the accuracy of the responses, as we have not worked on the game development teams the respondents report on, and in most cases, we don't even know what specific projects they relate to and have no way to find out.

However, we did ask an optional question at the end regarding the name of the project in question.  Roughly 10% of our respondents answered this question.  This allowed us to do two things:

  • For those that did answer the question, we looked at MetaCritic scores of those game projects, and were able to verify that the question regarding MetaCritic scores had indeed been answered accurately.
  • We had hoped that there would be several cases where different people on the same team reported on their project.  However, there is only one case in our data where two respondents reported on the same project AND supplied the name of the project in this optional answer field.  However, we did compare these two results and found that the answers were quite similar, with the answers to most questions differing by 1-2 gradations at most.
Therefore, although we have no way to independently verify the data, those two avenues of investigation underscored that we have no reason to doubt the veracity of the data.

Additionally, although some of our friends at GamaSutra were worried about the survey potentially being overrun or trolled by those who use the "#GamerGate" hashtag on Twitter (as a previous Gamasutra developer survey had allegedly been recently corrupted by this loose affiliation of individuals apparently angry at that publication), we heard no rumblings of any ill will toward our survey on social media, and we felt it was unlikely that anyone would complete an entire 120-question poll just to try to bastardize a survey.  We also felt that anyone attempting that kind of "trolling" would likely reveal themselves with snarky comments at the end of the survey, and we saw no comments whatsoever that appeared snarky, disingenuous, sarcastic, or otherwise likely to have come from anyone other than genuine game developers.  Therefore, there is simply no evidence that might allow us to believe that any such corruption would have occurred.

"Bitterness Bias"


Some on our team pointed out that there may have been pre-existing bias on the part of respondents to answer questions in a positive or negative way depending on the outcome.  In other words, participants who worked on a troubled project were probably more likely to feel lingering bitterness toward the project or the team and answer negatively -- especially if they were laid off or experienced significant stress on the team -- while respondents who had a positive experience would be more likely to answer positively.

We cannot deny that some minimum level of bias is entirely possible, or even quite likely, and that this surely impacted the answers to some degree.

However, a large part of the point of our study was to identify which specific factors made the most difference in outcomes.  We would expect that when people felt emotional bias toward or against a past game project that skewed their answers away from what they might otherwise have answered, this bias would affect most, if not all, of their answers in that direction.  However, we should not expect that it would create the kinds of clear correlations that we actually see in the study, where some elements have far stronger correlations than others.


Why We Call it a Predictive Model


We refer to our linear regression models (both the one used in the charts in parts 1 and 2 of our article series, and the slightly different ones included with the Team Self-Reflection Tool) as "predictive models."

We justify this claim due to the fact that with every linear regression we've built, we've been able to predict the outcome scores from the input factors with a very high degree of accuracy (correlations 0.6-0.82).  We've also maintained an "out-of-sample" set in each case, and we were able to show that the prediction performed just as well on the out-of-sample group as it did on the training set.

One can certainly argue that our correlations do not imply direct causal links, and there may be additional factors involved behind the factors we listed that are the actual causes of the correlations.  This may be true; however, this does not make it any less of a predictive model.

We know that the outcomes are caused either by the factors we listed or by other factors not listed which influenced both these factors and the related outcomes; we know for certain that the causality does not go the other way (i.e, the outcomes do not cause the inputs, since they came later in time).  So regardless of which case is true, it remains a predictive model.

Optional Questions - Text Entry


We also gave respondents an opportunity to optionally provide three forms of information via text entry boxes.  Roughly 5-10% of our respondents answered each of these.

  • We asked respondents what game they were replying about.  This was primarily to identify cases where multiple respondents referenced the same game (only 1 game mentioned was shared between 2 respondents, and the respondents' answers were nearly identical).
  • We asked for suggestions for improving the survey in the future.  Some of these are listed in "Future Directions," below.
  • We asked respondents to share any interesting comments about their experiences on the team.  Some of these stories were truly amazing (or horrifying).  Where we can do so without violating privacy, we are sharing these anonymously on Twitter at a rate of 1-4 every week.  All of these are marked with the #GameOutcomes hashtag for easier searching.


Future Directions


We regard the first iteration of the Game Outcomes project as a surprisingly successful experiment, but it has also given us an excellent guide for refining our questions in the future.

In future versions of the Game Outcomes Project, we expect to be able to narrow our list of over 100 questions down to 50 or so, and add a number of additional questions that we simply did not have room to ask in this version of the survey:
  • What was the working environment like?  Was it mostly comprised of cubicles, 1-person offices, multi-person offices, or open working space?  Did the working space facilitate team communication, and did it foster enough privacy when developers needed to hunker down, focus, and get work done?  Significant research indicates that cubicles, often championed in the industry for fostering communication, actually hinder both productivity and communication (link link link link)  At the same time, there is some evidence to indicate that a moderately noisy environment can enhance creativity.
  • Was a significant amount of work on this project thrown away due to re-work?
  • How did the team hire new members?  Was there a formal, standardized testing process?  Did the team do any level of behavioral interviewing techniques?
  • How long had the team worked together?  A significant amount of research shows that teams working together for the first time are far more mistake-prone, while teams that have worked together longer are far more productive (Hackman).
  • To what extent was the team working on an innovative design as opposed to a clone or a direct sequel?
  • To what extent was the studio’s structure flat or hierarchical?
  • To what extent was customer focus a part of the development culture?
  • Did the game’s production have a discrete preproduction phase, and if so, how good a job did the team do of ironing out the risks during preproduction?
  • Did most team members exhibit professional humility, or were there many know-it-alls who always tried to prove themselves smarter than everyone else?
  • Did the studio have fixed "producer" roles, or were production tasks shared by other team members?
  • How did accountability work at the studio?  How did the company determine who to hold accountable, and for what, and in what way?  Was the management particularly obsessed with holding individuals accountable?
  • When code reviews or peer programming occurred, what form did they take?  Were they performed as team reviews, one-on-one reviews, peer programming sessions, or reviewed checkins?  How many developers were involved in each review, and how frequently were they performed?
  • Did the organization have performance reviews, and if so, how did they work?  Were they manager-driven reviews, "360-degree" reviews, or stack-ranking, a la Valve?  If stack-ranking was used, was it democratic, or manager-driven?  Some surprising recent research indicates that performance reviews may actually not only be useless, but counterproductive in their entirety  ...  and some evidence that some forms of stack ranking (particularly those that require termination of the lowest-ranked N% of staff) are highly counterproductive.
  • We will likely ask respondents if they took the 2014 survey or read our articles on the results, so we can compare those who answered 'yes' or 'no' to each question and see whether this may have influenced their responses.
  • It has been noted that our focus on outcomes really only looks at one aspect of team effectiveness.  The other two aspects are individual development and well-being (i.e. did team members end up better off than they started, able to keep working, and with an improved skill set?) and team viability (is the team still intact, still working together well, and able to develop another game at least as good as the last one)?  This will allow us to not only answer the question of what made a team successful, but what made it effective.

Initially, we had also questions about the development team’s gender and ethnic composition and geographic location in the initial survey, but we had to drop these due to space constraints and concerns about spurious correlations; we may bring them back in future versions of the survey.

A number of additional questions were directly or indirectly suggested by survey respondents themselves in our optional text entry boxes at the end of the survey:

  • Was the team's leadership mostly composed of individuals from an art, programming, design, production, or biz dev background, or a combination?
  • What percentage of the design decisions on the game was made by those in the trenches – the artists, programmers, or designers?
  • What percentage of the design decisions were made by people in leadership roles with no formal design authority, such as producers?
  • Were the team leads particularly heavy in one discipline (art, engineering, or design), or was there a mix, or was the leadership comprised mostly of producers or managers with little or no discipline-specific experience?
  • If the team disagreed with a decision made by the project’s leadership, was it able to get the decision changed?
  • What was the process for how new team members were trained or incorporated into the team? 
  • To what extent did the team use internally-developed vs. externally-developed tools?
  • How would developers judge their quality of life?
  • Did team members have a good sense of when a feature is complete?
  • Did the team spend a significant amount of time and resources creating demos for upper management or the marketing department?
  • The Joel Test contains a number of interesting questions (more directly related to software development teams) worth investigating further.
  • Was the team happy, and did the project’s leadership work to help facilitate happiness?  (There is significant research showing happiness causes higher productivity, not the other way around; Scott Crabtree of Happy Brain Science can tell you more about this)
We are also considering modifying our outcome-related questions to allow respondents to select the ranking of the outcome factors in order of their importance to the team.  Although we suspect most teams will have return-on-investment (ROI) as the most important factor, we suspect a good deal of variability in the ranking of the remaining factors.  This ordering could help us develop a more accurate aggregate outcome score that took into account the actual importance of each outcome factor for each team by weighting them appropriately.

We will also likely add a question related to overall marketing spend, not because we doubt the role of marketing in altering a project's outcome, but only so that we can subtract the effect that we expect to see here.  In other words, taking marketing budgets into account will help us more accurately estimate the effect of all the other factors.

We will likely  use a lower p-value (0.01) in order to further reduce any uncertainty about our results (although, again, using this lower p-value on our current data set affects fewer than 5% of our results).

Finally, we will ask participants if they also took the 2014 version of the Game Outcomes Project survey and/or read its results, as this will allow us to detect any potential biases or differences in answers between those who did and did not participate in the previous survey, and whose answers may have been influenced by some level of bias due to awareness of our intentions with the survey.