Agile metrics and why team surveys fail

Most leaders and organisations understand the importance of agile metrics, even in agile project management or product management.

Many team surveys fall flat for a variety of reasons, most of these have to do with survey methodology. As a world-leading company in agile metrics, we’ve prepared the most common mistakes when measuring agile through surveys and what we’ve done to fix these problems for you.

Mistake #1: Self-reporting is biased

It’s tempting to throw together an agile project management metrics survey and ask teams to complete it.

Self-reporting by teams is one of the most common approaches for gathering metrics about agile teams. This method requires participants to respond to the a set of questions without interference from managers. Examples of self-reporting include questionnaires, surveys, or interviews. However, relative to other sources of information, such as expert review, self-reported metrics are often unreliable and threatened by self-reporting bias.

Bias typically arises from social desirability, recall period, sampling approach, and selective recall.

Social desirability bias

When leaders use a survey, questionnaire, or interview to collect data, the questions asked often pit one team against another: who has the biggest team happiness, who has the biggest velocity, who has the biggest attendance at Daily Scrum? Many surveys allow individual results to be compared. And when data isn’t aggregated, no one wants their individual responses to be overtly assessed. These factors result in people responding in the way that they feel is socially desirable: they underestimate and under report failings, and they over report success and wins.

Recall bias

Recall bias is a common error for survey situations, when the respondent doesn’t remember things accurately. It’s not about bad or good memory – people have selective memory by default. Essentially, after a few days, certain things stay and other things fade.

This type of bias often occurs in where people are required to evaluate their behaviours and actions retrospectively.

Preventing recall bias requires:

Verification of information by review of pre-existing metrics, data and other types of empirical evidence.
Using standardised data collection protocols, such as doing surveys or questionnaires at the end of a Sprint or Program Increment (PI) boundary and by the same person.

Mistake #2: Survey design is flawed

Questions don’t measure what you think they measure

The power of a survey is the strength of its ability to measure what it is intended to measure. Does asking about team happiness actually reflect how agile a team is? The science literature of teams and work shows that happiness derives from collective goal setting, self-management, mastery, connectedness, and purpose. If you wanted to make a team happy, just tell them they don’t have to do agile any more.

To understand the power of your survey you need to do statistical analysis on the responses over time and compare them to other independent agile metrics, such as factors around improved quality, decreased cost, delivery rates and cost savings.

Question design

Many surveys are flawed because of the way they ask those questions. Questions are often overly complex and don’t use the right type of scale or measurement. And while open ended questions are useful to gain insights into context, their interpretation is open to bias if you’re not an expert on content analysis and can assess frequency distribution of themes in responses without bias.

In designing and analysing question responses, be prepared to:

Use samples. Don’t get everyone to answer the survey. Work out how you will get a sample from the larger population.
Pilot the questions. Deliver the questionnaire and analyse the results using statistics.
Use statistical methods to understand emerging themes and statistical relevance.
Count how many times those themes emerge from questions.
Throw out questions that don’t give insights into those themes over time.
Report on the themes, not on the questions themselves.

Mistake #3: The results aren't repeatable

Different from chance? The Pepsi/Coke Challenge

If you took a taste test for flavour preference of Coke vs Pepsi, would you be able to tell the difference? Would a clear winner emerge? While the ads might point to Pepsi winning, the company doesn’t tell you that the responses were no different from chance. If Pepsi was tastier than Coke, you’d see a greater than 60/40 split in choosing Pepsi over Coke. The same goes for surveys.

Repeatable?

If one team was agile and another team wasn’t agile, is there be a difference in their responses? If the team responds in the morning compared to the afternoon, are the results more or less the same? Is there a statistical difference?

Inter-rater reliability

If the Scrum Master completes the survey for a team and then the Agile Coach for the Agile Release Train also does the survey, do you get the same result? If the survey is designed well, then there should be no statistical difference between rater’s scores.

Tip #1: Focus less on question-to-question assessment

A survey or questionnaire should provide input into understanding factors for agility (the output). Don’t compare individual questions over time. Compare instead the overall scores to independent variables like time to market, ability to pivot and cost savings.

This is where a data model is important. A data model has inputs (your questions) and understands the relationship between the inputs and outputs. This is what Agile IQ is intended to do.

Tip #2: Get the survey methodology right

Know your limitations in creating a survey tool for agile. It takes years for psychology experts to learn how to make and assess great statistically valid tools.

One of the most effective ways to design a questionnaire is:

Use a statement. Don’t ask a question.
Ask people to indicate the strength of their agreement to the scale.
Use a Likert scale. 1-6 is often useful. 1-10 might have too many data points. 1-3 is likely provide too little differentiation between scores.
Know when to use ‘Not Applicable’ as a response criteria. Just because the person doing the assessment doesn’t think a statement applies could actually mean they’re too embarrassed to answer the question.
If you’re going to ask about a behaviour, the scale should reflect that the behaviour occurs or it doesn’t occur.

Tip #3: Learn how to analyse the results

If you’re just doing a ‘quick and dirty’ survey, be prepared for the results to be little more than sentiment metrics. It will tell you that a score against one question went up, or down, but unless there’s modelling involved, it won’t measure how agile your teams are of if their agility has improved over time.

Human behaviour is complex, so there are many behaviours that provide input into understanding whether teams are getting stronger in their agile practices and mindset. Principle components analysis and factor analysis are the most common ways to understand what a behavioural model tells you about the questions you’ve asked. If the factors are likely to be correlated, you’ll need to do an oblique rotation.

Agile IQ and it's data model for agile metrics

An Agile IQ’s assessments are comprehensive. They look at every aspect of the agile mindset, from actions to behaviours and values.

How does the assessment work?

Agile IQ uses a 6-point Likert scale and with statements about actions and behaviours of agile teams that should be visible to you. The agree/disagree scale doesn’t reflect any value judgement on a particular behaviour or practice. The scale just asks if it happens or not and the strength of that behaviour.

Full-baseline assessment

A full-baseline assessment is done to establish the strength of all the behaviours across all of the potential areas of agile mindset and practices.

Do a full baseline assessment when you first create a team in Agile IQ. This assessment shouldn’t take longer than 20 minutes.

Quick assessment

A quick assessment takes a small random selection of the baseline questions, the ones that have the highest correlation with Agile IQ’s factors, and uses those as the input into the data model.

To do a quick assessment only takes 5 minutes.

Tips on getting the most out of an Agile IQ assessment.

Why is there no “not applicable”?

Because Agile IQ focuses on observed behaviours, there’s no “not applicable”: either the behaviour happens or it doesn’t. Either you’ve seen the behaviour (or you can infer it), and if not, then leave the question as “strongly disagree”.

What if I don’t understand the question?

If you don’t understand the question, just leave it and move on.

I’ve definitely seen the behaviour!

If you know you’ve seen the behaviour in a question, then rate the question as follows:

Strongly agree – Strong empirical evidence (not hearsay) that it always happens, every time, all the time, and there’s good strong evidence that they’ve been doing it for quite a while. This rating really only applies for behaviours that are very long-lived (in the area of about a year).
Agree – Some evidence that it happens more often than not.
Slightly agree – Very little evidence. This behaviour is something that they are probably just starting to do. This rating applies most to teams who are just starting out.

I don’t think I’ve seen the team do that

If you haven’t seen the behaviour in question:

Strongly disagree – Absolutely certain that it never happens. It’s not something the team feels is important or even registers as valuable at this time. For teams that are just starting out, there’s going to be a lot of behaviours they just aren’t yet doing. Don’t be afraid to give a new team a “strongly disagree”.
Disagree – Certain it doesn’t happen. No evidence visible, not even anecdotal. The team are aware that it’s something they should consider.
Slightly disagree – Probably doesn’t happen. No evidence visible. The team might say they do it, but you’ve not seen it happen. If the team does a behaviour once in a while or rarely, then “slightly disagree” is a good option to choose.

Conclusions

Many agile project management metrics are based solely around activity metrics and efficiency. Team surveys attempt to provide a qualitiative picture and draw conclusions between the delivery metrics and team sentiment. Most, as a result, aren’t statistically powerful or reliable. When reporting to executive about the repeatability and scalability of your agile capability, surveys don’t provide sufficient evidence to understand:

Why some teams are more agile than others.
What strategies and tactics will make agility scalable and repeatable.

Human behaviour is complex and so it needs predictive analytics, statistical models, and rules engines, not team surveys and sentiment analysis. This is where Agile IQ shines.

Agile IQ® is a robust statistical model and tool that empowers teams and managers to understand the strength of their behaviours and promote the metrics that really measure agility.

About the author

Matthew Hodgson

Matt is the CEO of Zen Ex Machina, Professional Scrum Trainer (PST) and SAFe SPC5. He's an author, keynote speaker, and a regular presenter at international conferences across Australia, USA, Asia, and Europe.

Agile metrics and why team surveys fail