The Intelligence Engine: 2014

Friday, December 19, 2014

The Game Outcomes Project, Part 1: The Best and the Rest

This article is the first in a 5-part series.

Part 1: The Best and the Rest is also available here: (Gamasutra) (BlogSpot) (in Chinese)
Part 2: Building Effective Teams is available here: (Gamasutra) (BlogSpot) (in Chinese)
Part 3: Game Development Factors is available here: (Gamasutra) (BlogSpot) (in Chinese)
Part 4: Crunch Makes Games Worse is available here: (Gamasutra) (BlogSpot) (in Chinese)
Part 5: What Great Teams Do is available here: (Gamasutra) (Blogspot) (in Chinese)
For extended notes on our survey methodology, see our Methodology blog page.
Our raw survey data (minus confidential info) is now available here if you'd like to verify our results or perform your own analysis.

The Game Outcomes Project team includes Paul Tozour, David Wegbreit, Lucien Parsons, Zhenghua “Z” Yang, NDark Teng, Eric Byron, Julianna Pillemer, Ben Weber, and Karen Buro.

The Game Outcomes Project, Part 1: The Best and the Rest

What makes the best teams so effective?

Veteran developers who have worked on many different teams often remark that they see vast cultural differences between them. Some teams seem to run like clockwork, and are able to craft world-class games while apparently staying happy and well-rested. Other teams struggle mightily and work themselves to the bone in nightmarish overtime and crunch of 80-90 hour weeks for years at a time, or in the worst case, burn themselves out in a chaotic mess. Some teams are friendly, collaborative, focused, and supportive; others are unfocused and antagonistic. A few even seem to be hostile working environments or political minefields with enough sniping and backstabbing to put a game of Team Fortress 2 to shame.

What causes the differences between those teams? What factors separate the best from the rest?

As an industry, are we even trying to figure that out?

Are we even asking the right questions?

These are the kinds of questions that led to the development of the Game Outcomes Project. In October and November of 2014, our team conducted a large-scale survey of hundreds of game developers. The survey included roughly 120 questions on teamwork, culture, production, and project management. We suspected that we could learn more from a side-by-side comparison of many game projects than from any single project by itself, and we were convinced that finding out what great teams do that lesser teams don’t do – and vice versa – could help everyone raise their game.

Our survey was inspired by several of the classic works on team effectiveness. We began with the 5-factor team effectiveness model described in the book Leading Teams: Setting the Stage for Great Performances. We also incorporated the 5-factor team effectiveness model from the famous management book The Five Dysfunctions of a Team: A Leadership Fable and the 12-factor model from 12: The Elements of Great Managing, which is derived from aggregate Gallup data from 10 million employee and manager interviews. We felt certain that at least one of these three models would surely turn out to be relevant to game development in some way.

We also added several categories with questions specific to the game industry that we felt were likely to show interesting differences.

On the second page of the survey, we added a number of more generic background questions. These asked about team size, project duration, job role, game genre, target platform, financial incentives offered to the team, and the team’s production methodology.

We then faced the broader problem of how to quantitatively measure a game project’s outcome.

Ask any five game developers what constitutes “success,” and you’ll likely get five different answers. Some developers care only about the bottom line; others care far more about their game’s critical reception. Small indie developers may regard “success” as simply shipping their first game as designed regardless of revenues or critical reception, while developers working under government contract, free from any market pressures, might define “success” simply as getting it done on time (and we did receive a few such responses in our survey).

Lacking any objective way to define “success,” we decided to quantify the outcome through the lenses of four different kinds of outcomes. We asked the following four outcome questions, each with a 6-point or 7-point scale:

"To the best of your knowledge, what was the game's financial return on investment (ROI)? In other words, what kind of profit or loss did the company developing the game take as a result of publication?"
"For the game's primary target platform, was the project ever delayed from its original release date, or was it cancelled?"
"What level of critical success did the game achieve?"
"Finally, did the game meet its internal goals? In other words, to what extent did the team feel it achieved something at least as good as it was trying to create?"

We hoped that we could correlate the answers to these four outcome questions against all the other questions in the survey to see which input factors had the most actual influence over these four outcomes. We were somewhat concerned that all of the “noise” in project outcomes (fickle consumer tastes, the moods of game reviewers, the often unpredictable challenges inherent in creating high-quality games, and various acts of God) would make it difficult to find meaningful correlations. But with enough responses, perhaps the correlations would shine through the inevitable noise.

We then created an aggregate “outcome” value that combined the results of all four of the outcome questions as a broader representation of a game project’s level of success. This turned out to work nicely, as it correlated very strongly with the results of each of the individual outcome questions. Our Methodology blog page has a detailed description of how we calculated this aggregate score.

We worked carefully to refine the survey through many iterations, and we solicited responses through forum posts, Gamasutra posts, Twitter, and IGDA mailers. We received 771 responses, of which 302 were completed, and 273 were related to completed projects that were not cancelled or abandoned in development.

The Results

So what did we find?

In short, a gold mine. The results were staggering.

More than 85% of our 120 questions showed a statistically significant correlation with our aggregate outcome score, with a p-value under 0.05 (the p-value gives the probability of observing such data as in our sample if the variables were be truly independent; therefore, a small p-value can be interpreted as evidence against the assumption that the data is independent). This correlation was moderate or strong in most cases (absolute value > 0.2), and most of the p-values were in fact well below 0.001. We were even able to develop a linear regression model that showed an astonishing 0.82 correlation with the combined outcome score (shown in Figure 1 below).

Figure 1. Our linear regression model (horizontal axis) plotted against the composite game outcome score (vertical axis). The black diagonal line is a best-fit trend line. 273 data points are shown.

To varying extents, all three of the team effectiveness models (Hackman's “Leading Teams” model, Lencioni's “Five Dysfunctions” model, and the Gallup “12” model) proved to correlate strongly with game project outcomes.

We can’t say for certain how many relevant questions we didn’t ask. There may well be many more questions waiting to be asked that would have shined an even stronger light on the differences between the best teams and the rest.

But the correlations and statistical significance we discovered are strong enough that it’s very clear that we have, at the very least, discovered an excellent partial answer to the question of what makes the best game development teams so successful.

The Game Outcomes Project Series

Due to space constraints, we’ll be releasing our analysis as a series of several articles, with the remaining 3 articles released at 1-week intervals beginning in January 2015. We’ll leave off detailed discussion of our three team effectiveness models until the second article in our series to allow these topics the thorough analysis they deserve.

This article will focus solely on introducing the survey and combing through the background questions asked on the second survey page. And although we found relatively few correlations in this part of the survey, the areas where we didn’t find a correlation are just as interesting as the areas where we did.

Project Genre and Platform Target(s)

First, we asked respondents to tell us what genre of game their team had worked on. Here, the results are all across the board.

Figure 2. Game genre (vertical axis) vs. composite game outcome score (horizontal axis). Higher data points (green dots) represent more successful projects, as determined by our composite game outcome score.

We see remarkably little correlation between game genre and outcome. In the few cases where a game genre appears to skew in one direction or another, the sample size is far too small to draw any conclusions, with all but a handful of genres having fewer than 30 responses.

(Note that Figure 2 uses a box-and-whisker plot, as described here).

We also asked a similar question regarding the product’s target platform(s), including responses for desktop (PC or Mac), console (Xbox/PlayStation), mobile, handheld, and/or web/Facebook. We found no statistically significant results for any of these platforms, nor for the total number of platforms a game targeted.

Project Duration and Team Size

We asked about the total months and years in development; based on this, we were able to calculate each project’s total development time in months:

Figure 3. Total months in development (horizontal axis) vs game outcome score (vertical). The black diagonal line is a trend line.

As you can see, there’s a small negative correlation (-0.229, using the Spearman correlation coefficient), and the p-value is 0.003. This negative correlation is not too surprising, as troubled projects are more likely to be delayed than projects that are going smoothly.

We also asked about the size of the team, both in terms of the average team size and the final team size. Average team size was between 1 and 500 with an average of 48.6; final team size was between 1 and 600 with an average of 67.9. Both showed a slight positive correlation with project outcomes, as shown below, but in both cases the p-value is well over 0.1, indicating there’s not enough statistical significance to make this correlation useful or noteworthy.

Note that in both figures below, the horizontal axis is shown on a logarithmic scale, which makes the linear trend line appear curved.

Figure 4. Average team size correlated against game project outcome (vertical axis).

Figure 5. Final team size correlated against game project outcome (vertical axis).

We also analyzed the ratio of average to final team size, but we found no meaningful correlations here.

Game Engines

We asked about the technology solution used: whether it was a new engine built from scratch; core technology from a previous version of a similar game or another game in the same series; an in-house / proprietary engine (such as EA Frostbite); or an externally-developed engine (such as Unity, Unreal, or CryEngine).

The results are as follows:

Figure 6. Game engine / core technology used (horizontal axis) vs game project outcome (vertical axis), using a box-and-whisker plot.

	Average composite score	Standard Deviation	Number of responses
New engine/tech	53.3	18.3	41
Engine from previous version of same or similar game	64.8	15.8	58
Internal/proprietary engine / tech (such as EA Frostbite)	60.7	19.4	46
Licensed game engine (Unreal, Unity, etc.)	55.6	17.5	113
Other	55.5	19.5	15

The results here are less striking the more you look at them. The highest score was for projects that used an engine from a previous version of the same game or a similar one – but that’s exactly what one would expect to be the case, given that teams in this category clearly already had a head start in production, much of the technical risk had already been stamped out, and there was probably already a veteran team in place that knew how to make that type of game!

We analyzed these results using a Kruskal-Wallis one-way analysis of variance, and we found that this question was only statistically significant on account of that very option (engine from a previous version of the same game or similar), with a p-value of 0.006. Removing the data points related to this answer category caused the p-value for the remaining categories to shoot up above 0.3.

Our interpretation of the data is that the best option for the game engine depends entirely on the game being made and what options are available for it, and that any one of these options can be the “best” choice given the right set of circumstances. In other words, the most reasonable conclusion is there is no universally “correct” answer separate from the actual game being made, the team making it, and the circumstances surrounding the game's development. That’s not to say the choice of engine isn’t terrifically important, but the data clearly shows that there plenty of successes and failures in all categories with only minimal differences in outcomes between them, clearly indicating that each of these four options is entirely viable in some situations.

We also did not ask which specific technology solution a respondent’s dev team was using. Future versions of the study may include questions on the specific game engine being used (Unity, Unreal, CryEngine, etc.)

Team Experience

We also asked a question on this page regarding the team’s average experience level, along a scale from 1 to 5 (with a ‘1’ indicating less than 2 years of average development experience, and a ‘5’ indicating a team of grizzled game industry veterans with an average of 8 or more years of experience).

Figure 7. Team experience level ranking (horizontal axis, by category listed above) mapped against game outcome score (vertical axis)

Here, we see a correlation of 0.19 (and p-value under 0.001). Note in particular the complete absence of dots in the upper-left corner (which would indicate wildly successful teams with no experience) and the lower-right corner (which would indicate very experienced teams that failed catastrophically).

So our study clearly confirms the common knowledge in the industry that experienced teams are significantly more likely to succeed. This is not at all surprising, but it's reassuring that the data makes the point so clearly. And as much we may all enjoy stories of random individuals with minimal game development experience becoming wildly successful with games developed in just a few days (as with Flappy Bird), our study shows clearly that such cases are extreme outliers.

Surprise #1: Incentives

This first page of our survey also revealed two major surprises.

The first surprise was financial incentives. The survey included a question: “Was the team offered any financial incentives tied to the performance of the game, the team, or your performance as individuals? Select all that apply.” We offered multiple check boxes to say “yes” or “no” to any combination of financial incentives that were offered to the team.

The correlations are as follows:

Figure 8. Incentives (horizontal axis) plotted against game outcome score (vertical axis) for the five different types of financial incentives, using a box-and-whisker plot. From left to right: incentives based on individual performance, team performance, royalties, incentives based on game reviews/MetaCritic scores, and miscellaneous other incentives. For each category, we split all 273 data points into those excluding the incentive (left side of each box) and those including the incentive (right side of each box).

Of these five forms of incentives, only individual incentives showed statistical significance. Game projects offering individually-tailored compensation (64 out of the 273 responses) had an average score of 63.2 (standard deviation 18.6), while those that did not offer individual compensation had a mean game outcome score of 56.5 (standard deviation 17.7). A Wilcoxon rank-sum test for individual incentives gave a p-value of 0.017 for this comparison.

All the other forms of incentives – those based on team performance, based on royalties, based on reviews and/or MetaCritic ratings, and any miscellaneous “other” incentives – show p-values that indicate that there was no meaningful correlation with project outcomes (p-values 0.33, 0.77, 0.98, and 0.90, respectively, again using a Wilcoxon rank-sum test).

This is a very surprising finding. Incentives are usually offered under the assumption that they are a huge motivator for a team. However, our results indicate that only individual incentives seem to have the desired effect, and even then, to a much smaller degree than expected.

One possible explanation is that perhaps the psychological phenomenon popularized by Dan Pink may be playing itself out in the game industry – that financial rewards are (according to a great deal of recent research) usually a completely ineffective motivational tool, and actually backfire in many cases.

We also speculate that in the case of royalties and MetaCritic reviews in particular, the sense of helplessness that game developers can feel when dealing with factors beyond their control – such as design decisions they disagree with, or other team members falling down on the job – potentially compensates for any motivating effect that incentives may have had. With individual incentives, on the other hand, individuals may feel that their individual efforts are more likely to be noticed and rewarded appropriately. However, without more data, this all remains pure speculation on our part.

Whatever the reason, our results seem to indicate that individually tailored incentives, such as Pay For Performance (PFP) plans, seem to achieve meaningful results where royalties, team incentives, and other forms of financial incentives do not.

Surprise #2: Production Methodologies

Our second big surprise was in the area of production methodologies, a topic of frequent discussion in the game industry.

We asked what production methodology the team used – 0 (don’t know), 1 (waterfall), 2 (agile), 3 (agile using “Scrum”), and 4 (other/ad-hoc). We also provided a detailed description with each answer so that respondents could pick the closest match according to the description even if they didn’t know the exact name of the production methodology. The results were shocking.

Figure 9. Production methodology vs game outcome score.

Here's a more detailed breakdown showing the mean and standard deviation for each category, along with the number of responses in each:

	Average composite score	Standard Deviation	Number of responses
Unknown	50.6	17.4	7
Waterfall	55.4	17.9	53
Agile	59.1	19.4	94
Agile using Scrum	59.7	16.9	75
Other / Ad-hoc	57.6	17.6	44

What’s remarkable is just how tiny these differences are. They almost don’t even exist.

Furthermore, a Kruskal-Wallis H test indicates a very high p-value of 0.46 for this category, meaning that we truly can’t infer any relationship between production methodology and game outcome. Further testing of the production methodology against each of the four game project outcome factors individually gives identical results.

Given that production methodologies seem to be a game development holy grail for some, one would expect to see major differences, and that Scrum in particular would be far out in the lead. But these differences are tiny, with a huge amount of variation in each category, and the correlations between the production methodology and the score have a p-value too high for us to deny the assumption that the data is independent. Scrum, agile, and “other” in particular are essentially indistinguishable from one another. “Unknown” is far higher than one would expect, while “Other/ad-hoc” is also remarkably high, indicating that there are effective production methodologies available that aren’t on our list (interestingly, we asked those in the “other” category for more detail, and the Cerny method was listed as the production methodology for the top-scoring game project in that category).

Also, unlike our question regarding game engines, we can't simply write this off as some methodologies being more appropriate for certain kinds of teams. Production methodologies are generally intended to be universally useful, and our results show no meaningful correlations between the methodology and the game genre, team size, experience level, or any other factors.

This begs the question: where’s the payoff?

We’ve seen several significant correlations in this article, and we will describe many more throughout our study. Articles 2 and 3 in particular will illustrate many remarkable correlations between many different cultural factors and game outcomes, with more than 85% of our questions showing a statistically significant correlation.

So it’s very clear that where there were significant drivers of project outcomes, they stood out very clearly. Our results were not shy. And if the specific production methodology a team uses is really vitally important, we would expect that it absolutely should have shown up in the outcome correlations as well.

But it’s simply not there.

It seems that in spite of all the attention paid to the subject, the particular type of production methodology a team uses is not terribly important, and it is not a significant driver of outcomes. Even the much-maligned “Waterfall” approach can apparently be made to work well.

Our third article will detail a number of additional questions we asked around production that give some hints as to what aspects of production actually impact project outcomes regardless of the specific methodology the team uses -- although these correlations are still significantly weaker on average than any of our other categories concerning culture.

Conclusions

We are beginning to crack open the differences that separate the best teams from the rest.

We have seen that four factors – total project duration, team experience level, financial incentives based on individual performance, and re-use of an existing game engine from a similar game – have clear correlations with game project outcomes.

Our study found several surprises, including a complete lack of any correlations between factors that one would assume should have a large impact, such as team size, game genre, target platforms, the production methodology the team used, or any additional financial incentives the team was offered beyond individual performance compensation.

In the second article in the series, we discuss the three team effectiveness models that inspired our study in detail and illustrate their correlations with the aggregate outcome score and each of the individual outcome questions. We see far stronger correlations than anything presented in this article.

Following that, the third article will explore additional findings around many other factors specific to game development, including technology risk management, design risk management, crunch / overtime, team stability, project planning, communication, outsourcing, respect, collaboration / helpfulness, team focus, and organizational perceptions of failure. We will also summarize our findings and provide a self-reflection tool that teams can use for postmortems and self-analysis.

Finally, our fourth article will bring our data to bear on the controversial issue of crunch and draw unambiguous conclusions.

The Game Outcomes Project team would like to thank the hundreds of current and former game developers who made this study possible through their participation in the survey. We would also like to thank IGDA Production SIG members Clinton Keith and Chuck Hoover for their assistance with question design; Kate Edwards of the IGDA for assistance with promotion; and Christian Nutt and the Gamasutra editorial team for their assistance in promoting the survey.

For announcements regarding our project, follow us on Twitter at @GameOutcomes

Thursday, November 13, 2014

Game Outcomes Project Methodology

This page is designed to explain the technical details of the Game Outcomes Project, as the technical appendix to a 5-part series.

Part 1: The Best and the Rest is also available here: (Gamasutra) (BlogSpot) (in Chinese)
Part 2: Building Effective Teams is available here: (Gamasutra) (BlogSpot) (in Chinese)
Part 3: Game Development Factors is available here: (Gamasutra) (BlogSpot) (in Chinese)
Part 4: Crunch Makes Games Worse is available here: (Gamasutra) (BlogSpot) (in Chinese)
Part 5: What Great Teams Do is available here: (Gamasutra) (Blogspot) (in Chinese)
For extended notes on our survey methodology, see our Methodology blog page.
Our raw survey data (minus confidential info) is now available here if you'd like to verify our results or perform your own analysis.

The Game Outcomes Project team includes Paul Tozour, David Wegbreit, Lucien Parsons, Zhenghua “Z” Yang, NDark Teng, Eric Byron, Julianna Pillemer, Ben Weber, and Karen Buro.

In October and early November of 2014, the Game Outcomes Project team ran a survey targeting game developers asking roughly 120 questions about each respondent's most recent team-based game development effort. Questions centered on development culture, teamwork, project management, and the project's outcome.

We received 771 responses, of which 302 were completed, and 273 referred to projects that were neither cancelled nor abandoned.

We chose to exclude responses related to cancelled and abandoned projects for reasons we explain at the end of this post. This blog post explains our survey design methodology and our analytical methodology for the remaining 273 questions for those who wish to delve into the statistical and mathematical details.

In other words, this is the page for all of the gory technical details we didn't want to put in the other articles.

You can download the survey questions here: (PDF).

We will also make our full data set available shortly after publication of the final article.

Survey Design Approach

Our survey was designed as follows:

Page 1 was a merely a qualifier page. In order to ensure that we received responses only from our target demographic, we asked respondents to only complete the survey for projects with known outcomes within the last 3 years which had a team size of 3 or more and on which they had served in a development role. Although we posed these questions as four check boxes, this was merely to dissuade casual survey-takers or inappropriate respondents, and the answers to these questions were ignored in our analysis.
Page 2 contained background questions such as team size, production methodology, financial incentives offered to the team, and project lifetime.
Pages 3 and 4 contained a little over 100 questions around teamwork, culture, and various other factors relating to game development using a 7-point Likert scale.
Page 5 asked only four questions, asking respondents to rate the outcome of the project along a 6- or 7- point Likert scale for four different dimensions: project delays or cancellation, return on investment, aggregate review scores (MetaCritic or GameRankings), and the team's internal satisfaction with whether the project had achieved its goals.

In designing the survey, we took several steps to help reduce the risk of bias. Although some survey respondents have complained about these aspects of the survey, they were entirely intentional:

We occasionally asked important questions twice, once with a positive frame (“I rarely had to work overtime”) and once with a negative frame (“I had to work a lot of overtime”). We did this as we were concerned that if we asked all questions with same frame, respondents who felt positively or negatively about a project would be tempted to simply scroll through the list and click somewhere along the “agree” or “disagree” part of the spectrum. We felt that intentionally disrupting this behavior by occasionally shifting the frame of the questions would force the respondent to pay attention and consider each question individually.

We sometimes asked slightly different versions of the same question in an attempt to tease out the cause of some phenomenon. For example, we had five questions related to “crunch” and overtime, and one of those questions asked about voluntary overtime while another asked about mandatory, imposed overtime. We felt we could use these subtle distinctions to tease out deeper causal factors (spoiler: we could, and we did).

We deliberately removed the section names and enabled randomization of the question ordering via SurveyMonkey. Although this led to a large wall of questions that was off-putting to some respondents, we felt that announcing the section names openly might tip our hat as to what we were looking for in each section, and allowing the answers in each section to remain clustered together would likely have a similar effect and allow respondents to simply use the same answer for all questions in that group. By randomizing the ordering of all the questions on a given page, we would greatly reduce the likelihood of these sorts of phenomena.

We added a qualification page at the beginning of the survey asking respondents to continue only if they had worked on a team project in a development role within the last 3 years that had some sort of known outcome. We deliberately wanted to avoid continuously-developed projects as the outcomes of these types of efforts are much more difficult to quantify.

Participants

We recruited participants from several different sources. Of the 771 responses we received, our best guess as to the distribution (based on the timings of the various surges we observed in survey completion) is as follows:

~100 responses: posts on TheChaosEngine.com (internal, private game industry forums)
~120 responses: announcement on Gamasutra.com
~400 responses: direct IGDA mailer exclusively to IGDA members
~150 responses: Twitter announcements, various other forum posts, and other/unknown sources

Given the diversity of sources both here and in our final responses, we feel comfortable asserting that our results represent many different teams of many different sizes (though with a moderate bent toward AAA development over indie development, based on the "final team size" results).

However, we have no way to track completion rates, so it's impossible for us to determine which of the final 273 responses (all responses which were fully completed and referred to non-cancelled projects) derived from each source.

The Aggregate Game Outcome Score

Lacking any way to objectively define “success” or "failure," we decided that the best way to quantify the outcome was through the lenses of four different kinds of outcomes – critical reception, return on investment (ROI), delays or cancellation, and the team’s own perception of its success or failure – and later combine them into a single “outcome” score during the post-survey analysis phase. This led to four questions, each of which allowed answers along a 6- or 7-point scale.

Delays: “For the game's primary target platform, was the project ever delayed from its original release date, or was it cancelled?”
ROI: “To the best of your knowledge, what was the game's financial return on investment (ROI)? In other words, what kind of profit or loss did the company developing the game take as a result of publication?”
MetaCritic: “To the best of your knowledge, was the game a critical success?”
Internal: “Finally, did the game meet its internal goals? In other words, was the team happy with the game they created, and was it at least as good as the game you were trying to make?”

We base the decision to combine the four outcome values into a single score on two factors. First, since all four questions are related to different aspects of project outcomes, it seems intuitively obvious that they are related, and that all of these four different aspects of the project’s outcome must have come after the end of development, and had to have been caused by the development cycle itself (or other factors, such as consumer tastes and marketing spend) rather than by each other.

Secondly, all four outcomes are strongly positively correlated to one another, as shown in the scatterplots below.

Figure 1. Animated GIF of cross-correlations betweeen all four game project outcome factors (on a 4-second delay).

Note that this image is an animated GIF with a 4-second delay; if you don't see it changing, wait a bit longer for it to finish loading. Also note that all data has been randomly "jittered" slightly for these charts to make coincident data points more visible. Note also that all four of these output dimensions have been "normalized" on a 0-1 scale with a "1" being the best possible outcome (the game shipped on time, critics loved it, it made huge piles of cash, or the team was thrilled with the fruits of their own efforts), and lower values being quantized equally along the 0-1 scale depending on the number of gradations in the question.

Each of these correlations has a p-value (statistical significance) under 0.05 (the p-value gives the probability of observing such data as in our sample if the variables were be truly independent; therefore, a small p-value can be interpreted as evidence against the assumption that the data is independent). This makes it very clear that the four aspects of game project outcomes are interrelated.

We eventually settled on a simple non-weighted sum for the aggregate outcome score. Although we were tempted to give each outcome value a coefficient, there is no objective basis for determining the coefficients.

We assigned the best possible outcome for each factor (terrific reviews, makes lots of money, no delays, team couldn't be happier with it) a value of 1.0, and we gave a worse outcome a correspondingly lower score (closer to 0) along a linear scale depending on the number of gradations in the questions asked (some of the outcome questions were asked on a 6-point scale, others on a 7-point scale). We then added them together.

Score = 25 * ((Delays) + (ROI) + (MetaCritic) + (Internal))

Note that the multiplication by 25 effectively converts the score to a 0-100 range, since each of the 4 outcome values is between 0 and 1.

We also experimented with exponents for each factor which we tuned in Solver to try to maximize the cross-correlations between the outcome factod, and with multiplying them as a probability value instead of simply adding them. However, we found that simply adding the four outcome factors, in addition to being simplest, achieved the highest correlation, and we could not justify the additional complexity of any other approach.

Missing Data Handling

Roughly 5% of the data in our survey was missing, as we allowed respondents to leave a small number of questions blank on pages 3-5 of the survey.

For the majority of the responses, we simply averaged the non-blank data for each question using the AVERAGEIF() function in Excel, and then used this to fill missing data for that question.

For the four outcome questions, given the critical nature of these values, we felt a more exhaustive approach was required. Here, we used the mean of two values: the average value of all non-empty responses to that question, and the average of all other non-empty outcome values for that response.

Correlations and p-Values

As nearly all of our data used a Likert scale, the Spearman correlation was a more appropriate measure than the more commonly-used Pearson correlation coefficient. This required us to use the SCORREL() function from the Real Statistics Resource Pack available from real-statistics.com rather than the built-in CORREL() function in Excel.

In practice, we found there was little difference between the two -- typically, a difference of less than 0.02 for nearly all of our correlations, though occasionally (in less than 2% of cases) the difference was as large as 0.07. However, despite these nearly-identical results, we felt it was essential to go the extra mile and use the more-accurate Spearman correlation coefficient.

We used a p-value threshold of 0.05 for each factor in our survey; however, only 4 questions had p-values between 0.01 and 0.05, so had we used a lower p-value threshold of 0.01, this would have only invalidated 4 of our 120 questions, which would not materially change our results.

In cases where we compared a binary variable to the combined outcome score, we used the Wilcoxon Rank Sum Test to determine p-values (via the WTEST() Excel function provided by the Real Statistics Resource Pack). This includes the various types of financial incentives discussed in article 1.

In cases where we compared a variable with several discrete, independent values to the combined outcome score (such as which production methodology or game engine was used, as discussed in the first article), we used the Kruskal-Wallis test to determine p-values (via the KTEST() function provided by the Real Statistics Resource Pack).

Cancelled and Abandoned Projects

We decided to ignore responses that turned out to be for cancelled or abandoned projects. This was a tough decision, but the fundamental problem is that we have no good way to assign an outcome value to a game project that was cancelled or abandoned before completion – the “outcome” has to include its critical reception and ROI for a real direct comparison, and since it was cancelled before completion, these will never be known.

Initially, we felt a 0 was a proper score for a cancelled project. This makes intuitive sense, as surely a cancelled project is the worst possible outcome and has no value, right?

But this isn’t necessarily the case. There’s a world of difference between a team abandoning what could have been a great game 3 months into development because they decided that working on some other, even greater game project would be a better use of their time, and a team slogging through a multi-year death march, impairing their health with extended overtime, and ending up with divorces, only to see their game cancelled at the last moment. Those two games should score very differently in terms of their “outcome,” and for cancelled or abandoned projects, that data does not exist.

There’s also the simple fact that many times, cancellations and abandonment are caused by factors outside the team’s control. Perhaps a key employee ran into health issues, or perhaps despite a team being terrific and working on a very promising game, the parent company ran out of money and had to close up shop. These kinds of stories happen all the time, and of course there would be no way for our survey to detect these things.

That’s not to say that cancellation and abandonment are entirely random. However, we found that the correlations with cancellation were generally far lower, and only a handful of variables correlated reasonably well with this outcome. We hope to discuss the cancellation issue further in a future article, but for main part of our series, we focus solely on non-cancelled game projects.

Predictive Modeling

We looked at a number of different ways of building predictive models that would use all the inputs to predict the aggregate outcome score. We imported the data into Weka and tried the following models:

Linear Regression Full: 0.82
Linear Regression 10-fold: 0.51
M5 Prime Full: 0.89
M5 Prime 10-Fold:0.59
Additive Regression (20 learners) Full: 0.81
Additive Regression (20 learners) 10-fold: 0.62

We also built two linear regression models in Excel, limiting ourselves only to inputs which exhibited statistically significant correlations (p-value < 0.05) with the aggregate outcome score (this excluded only roughly 30 of the ~120 survey questions). The full linear correlation achieved a correlation of 0.82, identical to the Weka linear regression above.

However, to avoid overfitting, we later constrained the linear regression so that the correlation coefficients had to have the same signs as the correlations of those underlying inputs. This gave us a correlation of 0.73 -- still an excellent correlation.

We also ran cross-validation with separate parts of the data set (excluding 20 data points at a time, roughly 10% of the data set) against this linear regression, with identical results.

We ultimately used these linear regression coefficients to help us identify the most useful and relevant independently predictive variables in the Self-Reflection Tool and to construct the linear regression model provided in that tool.

Data Verification

We asked respondents to subjectively grade nearly everything in our survey. Therefore, we cannot independently verify the accuracy of the responses, as we have not worked on the game development teams the respondents report on, and in most cases, we don't even know what specific projects they relate to and have no way to find out.

However, we did ask an optional question at the end regarding the name of the project in question. Roughly 10% of our respondents answered this question. This allowed us to do two things:

For those that did answer the question, we looked at MetaCritic scores of those game projects, and were able to verify that the question regarding MetaCritic scores had indeed been answered accurately.
We had hoped that there would be several cases where different people on the same team reported on their project. However, there is only one case in our data where two respondents reported on the same project AND supplied the name of the project in this optional answer field. However, we did compare these two results and found that the answers were quite similar, with the answers to most questions differing by 1-2 gradations at most.

Therefore, although we have no way to independently verify the data, those two avenues of investigation underscored that we have no reason to doubt the veracity of the data.

Additionally, although some of our friends at GamaSutra were worried about the survey potentially being overrun or trolled by those who use the "#GamerGate" hashtag on Twitter (as a previous Gamasutra developer survey had allegedly been recently corrupted by this loose affiliation of individuals apparently angry at that publication), we heard no rumblings of any ill will toward our survey on social media, and we felt it was unlikely that anyone would complete an entire 120-question poll just to try to bastardize a survey. We also felt that anyone attempting that kind of "trolling" would likely reveal themselves with snarky comments at the end of the survey, and we saw no comments whatsoever that appeared snarky, disingenuous, sarcastic, or otherwise likely to have come from anyone other than genuine game developers. Therefore, there is simply no evidence that might allow us to believe that any such corruption would have occurred.

"Bitterness Bias"

Some on our team pointed out that there may have been pre-existing bias on the part of respondents to answer questions in a positive or negative way depending on the outcome. In other words, participants who worked on a troubled project were probably more likely to feel lingering bitterness toward the project or the team and answer negatively -- especially if they were laid off or experienced significant stress on the team -- while respondents who had a positive experience would be more likely to answer positively.

We cannot deny that some minimum level of bias is entirely possible, or even quite likely, and that this surely impacted the answers to some degree.

However, a large part of the point of our study was to identify which specific factors made the most difference in outcomes. We would expect that when people felt emotional bias toward or against a past game project that skewed their answers away from what they might otherwise have answered, this bias would affect most, if not all, of their answers in that direction. However, we should not expect that it would create the kinds of clear correlations that we actually see in the study, where some elements have far stronger correlations than others.

Why We Call it a Predictive Model

We refer to our linear regression models (both the one used in the charts in parts 1 and 2 of our article series, and the slightly different ones included with the Team Self-Reflection Tool) as "predictive models."

We justify this claim due to the fact that with every linear regression we've built, we've been able to predict the outcome scores from the input factors with a very high degree of accuracy (correlations 0.6-0.82). We've also maintained an "out-of-sample" set in each case, and we were able to show that the prediction performed just as well on the out-of-sample group as it did on the training set.

One can certainly argue that our correlations do not imply direct causal links, and there may be additional factors involved behind the factors we listed that are the actual causes of the correlations. This may be true; however, this does not make it any less of a predictive model.

We know that the outcomes are caused either by the factors we listed or by other factors not listed which influenced both these factors and the related outcomes; we know for certain that the causality does not go the other way (i.e, the outcomes do not cause the inputs, since they came later in time). So regardless of which case is true, it remains a predictive model.

Optional Questions - Text Entry

We also gave respondents an opportunity to optionally provide three forms of information via text entry boxes. Roughly 5-10% of our respondents answered each of these.

We asked respondents what game they were replying about. This was primarily to identify cases where multiple respondents referenced the same game (only 1 game mentioned was shared between 2 respondents, and the respondents' answers were nearly identical).
We asked for suggestions for improving the survey in the future. Some of these are listed in "Future Directions," below.
We asked respondents to share any interesting comments about their experiences on the team. Some of these stories were truly amazing (or horrifying). Where we can do so without violating privacy, we are sharing these anonymously on Twitter at a rate of 1-4 every week. All of these are marked with the #GameOutcomes hashtag for easier searching.

Future Directions

We regard the first iteration of the Game Outcomes project as a surprisingly successful experiment, but it has also given us an excellent guide for refining our questions in the future.

In future versions of the Game Outcomes Project, we expect to be able to narrow our list of over 100 questions down to 50 or so, and add a number of additional questions that we simply did not have room to ask in this version of the survey:

What was the working environment like? Was it mostly comprised of cubicles, 1-person offices, multi-person offices, or open working space? Did the working space facilitate team communication, and did it foster enough privacy when developers needed to hunker down, focus, and get work done? Significant research indicates that cubicles, often championed in the industry for fostering communication, actually hinder both productivity and communication (link link link link) At the same time, there is some evidence to indicate that a moderately noisy environment can enhance creativity.
Was a significant amount of work on this project thrown away due to re-work?
How did the team hire new members? Was there a formal, standardized testing process? Did the team do any level of behavioral interviewing techniques?
How long had the team worked together? A significant amount of research shows that teams working together for the first time are far more mistake-prone, while teams that have worked together longer are far more productive (Hackman).
To what extent was the team working on an innovative design as opposed to a clone or a direct sequel?
To what extent was the studio’s structure flat or hierarchical?
To what extent was customer focus a part of the development culture?
Did the game’s production have a discrete preproduction phase, and if so, how good a job did the team do of ironing out the risks during preproduction?
Did most team members exhibit professional humility, or were there many know-it-alls who always tried to prove themselves smarter than everyone else?
Did the studio have fixed "producer" roles, or were production tasks shared by other team members?
How did accountability work at the studio? How did the company determine who to hold accountable, and for what, and in what way? Was the management particularly obsessed with holding individuals accountable?
When code reviews or peer programming occurred, what form did they take? Were they performed as team reviews, one-on-one reviews, peer programming sessions, or reviewed checkins? How many developers were involved in each review, and how frequently were they performed?
Did the organization have performance reviews, and if so, how did they work? Were they manager-driven reviews, "360-degree" reviews, or stack-ranking, a la Valve? If stack-ranking was used, was it democratic, or manager-driven? Some surprising recent research indicates that performance reviews may actually not only be useless, but counterproductive in their entirety ... and some evidence that some forms of stack ranking (particularly those that require termination of the lowest-ranked N% of staff) are highly counterproductive.
We will likely ask respondents if they took the 2014 survey or read our articles on the results, so we can compare those who answered 'yes' or 'no' to each question and see whether this may have influenced their responses.
It has been noted that our focus on outcomes really only looks at one aspect of team effectiveness. The other two aspects are individual development and well-being (i.e. did team members end up better off than they started, able to keep working, and with an improved skill set?) and team viability (is the team still intact, still working together well, and able to develop another game at least as good as the last one)? This will allow us to not only answer the question of what made a team successful, but what made it effective.

Initially, we had also questions about the development team’s gender and ethnic composition and geographic location in the initial survey, but we had to drop these due to space constraints and concerns about spurious correlations; we may bring them back in future versions of the survey.

A number of additional questions were directly or indirectly suggested by survey respondents themselves in our optional text entry boxes at the end of the survey:

Was the team's leadership mostly composed of individuals from an art, programming, design, production, or biz dev background, or a combination?
What percentage of the design decisions on the game was made by those in the trenches – the artists, programmers, or designers?
What percentage of the design decisions were made by people in leadership roles with no formal design authority, such as producers?
Were the team leads particularly heavy in one discipline (art, engineering, or design), or was there a mix, or was the leadership comprised mostly of producers or managers with little or no discipline-specific experience?
If the team disagreed with a decision made by the project’s leadership, was it able to get the decision changed?
What was the process for how new team members were trained or incorporated into the team?
To what extent did the team use internally-developed vs. externally-developed tools?
How would developers judge their quality of life?
Did team members have a good sense of when a feature is complete?
Did the team spend a significant amount of time and resources creating demos for upper management or the marketing department?
The Joel Test contains a number of interesting questions (more directly related to software development teams) worth investigating further.
Was the team happy, and did the project’s leadership work to help facilitate happiness? (There is significant research showing happiness causes higher productivity, not the other way around; Scott Crabtree of Happy Brain Science can tell you more about this)

We are also considering modifying our outcome-related questions to allow respondents to select the ranking of the outcome factors in order of their importance to the team. Although we suspect most teams will have return-on-investment (ROI) as the most important factor, we suspect a good deal of variability in the ranking of the remaining factors. This ordering could help us develop a more accurate aggregate outcome score that took into account the actual importance of each outcome factor for each team by weighting them appropriately.

We will also likely add a question related to overall marketing spend, not because we doubt the role of marketing in altering a project's outcome, but only so that we can subtract the effect that we expect to see here. In other words, taking marketing budgets into account will help us more accurately estimate the effect of all the other factors.

We will likely use a lower p-value (0.01) in order to further reduce any uncertainty about our results (although, again, using this lower p-value on our current data set affects fewer than 5% of our results).

Finally, we will ask participants if they also took the 2014 version of the Game Outcomes Project survey and/or read its results, as this will allow us to detect any potential biases or differences in answers between those who did and did not participate in the previous survey, and whose answers may have been influenced by some level of bias due to awareness of our intentions with the survey.