Commentary

Statistical Analysis for User Research: When Is It Meaningful?

Commentary

Statistical Analysis for User Research: When Is It Meaningful?



This article is re-posted from User Centric’s blog.

Consider the following proposition:

We can derive meaningful statistics from the smaller study samples typical of user research, and we shouldn’t be afraid to use these statistics to confidently guide business decisions!

A lot of eyebrows probably inch upward with the mention of “statistics” and “small sample” in the same breath. But Jeff Sauro, a recognized leader in stats in our field, has made it his mission to squeeze as much information as possible out of small samples to guide big decisions. By day, Sauro works at Oracle. By night, he runs http://www.measuringusability.com/, a website about statistics for the usability practitioner. Jeff transforms wonk-speak into something consumable for all. When he talks, he just makes sense.

Sauro recently visited User Centric (now GfK’s User Experience team) to conduct a workshop on statistical analysis for usability research, and he stirred up a lot of good debate around when and how statistics are applicable and meaningful in our work. Below are some UX consultants’ take-aways from the workshop.

Challenge your statistical senses!

For most who learn statistics, numbers like n=30 are hammered home as though they are absolutes. Of course, they’re not, but we end up being more cautious—perhaps overly so—about using statistics when conducting user research. Jeff’s workshop really helped us appreciate how statistics can be used, even with small samples. Conducting statistical analyses allows us to better understand the data that we do get, insofar as we can talk about our confidence in the numbers.

Statistical analysis is most useful when evaluated alongside other data.

It’s not that we can’t get meaningful data from small data sets. It’s just that we should never look at statistical significance in isolation from other qualitative data. Jeff’s practical view is that there are no absolute thresholds that make quantitative data valid or invalid, just levels of confidence that the sample data represent real population parameters. In business, taking action on these numbers translates to taking risks and placing bets.

We often see that one participant does something notable in a usability test that involved 25 participants. Statistically, we can say that as many as 20 percent of the entire user population would make the same mistake (though it is far more likely to be more like 4 percent).

The math, on its own, is not meaningful. But in the context of a client’s problem or objective, one can make an educated decision about what level of risk they are or aren’t willing to take. When choosing between designs for a basic mobile app for an e-commerce website, the stakes are not likely to be as high as when validating a drug delivery device, which, if used improperly, could result in an accidental death.

In statistics, “liberal” doesn’t have to be a dirty word.

Some of Jeff’s philosophies contradict much of what the more “conservative” camp of statistical thinkers hold sacred, e.g., being mindful of family-wise error rate when conducting multiple paired sample t-tests, not violating the assumptions of the ANOVA, and properly utilizing post-hoc tests when needed. This isn’t to say that the more liberal philosophy is wrong, but the two schools of thought exist for a reason. Each camp has an understanding of statistical theory, but have different opinions about which rules are flexible and how far they can be bent. We may, for example, choose to run an ANOVA, and a client may point out that the data were not homogeneous. In that case, we should be able to competently express why the analysis was conducted, and assure the client that this test is actually fairly robust to violations of the homogeneity assumption. The lesson being that, despite what camp we may fall into, we should be cognizant that our clients may have a different view and be prepared to explain the relevance of our analysis method.

Whether liberal or conservative, we should always be mindful of tools that let us quickly and easily conduct most types of basic statistical analysis. It doesn’t matter if the tool is SPSS, R, Matlab, etc…, what matters is that the user understands why they are choosing a particular analysis. It’s easy to follow a quick guide in a software program which directs you towards using a certain test, but sometimes (as Admiral Ackbar would say) “It’s a trap!” Whether it’s a t-test, ANOVA, MANOVA, linear regression or any other type of test, there are caveats that dictate when one should or shouldn’t utilize them. When one views statistics as simply plugging numbers into a program and not thinking about experimental structure or the relationships among variables, that’s when mistakes can happen.

Planning is key to good statistical analysis.

We would rarely decide to do stats after the fact. Discussing whether a client wants or could benefit from statistical results (beyond confidence intervals) in advance of the study is key to designing an appropriate study. With large validation studies, it may be appropriate and possible to dig a little deeper, to go beyond qualitatively explaining failures with medical devices. If we know qualitatively that there’s some obvious effect going on, we can gather data about it. If the difference or the relationship is big, we’ll see it, even with a small sample. So there are some big opportunities for our clients with statistical analysis.

One opportunity is comparing to competitors, looking for differences. Being able to say something is statistically significant can be the icing on the cake that helps the client feel confident that their product or process is different from the competitor. In medical device studies, we can look at the variables that predict success as well.

Whether we do stats or not really comes down to what the client needs to know, and how important it is for them to know. Is it enough for them to know that there appears to be a qualitative difference? Or do they need to know that the difference is statistically significant? Stats don’t take that much time if you know what you are going to do in advance, and if you build your data sheet accordingly.

Statistical analysis juice isn’t always worth the squeeze.

There is a time investment issue to be sensitive about. Good statistical analysis can be time consuming. How much analysis is worth it to aid business decisions? The fact is that for some business questions, statistical analysis may not be an efficient use of time and resources, so we need to be mindful of the time we spend doing it, and we need to be sure it is appropriate. We aren’t going to spend time doing stats when it really isn’t necessary to do so. We aren’t going to waste budget on an overly complex analysis when we really don’t need to. Sure, someone can program a mobile phone in a perfect lab setting, and we should design to make it as easy as possible. But are there other designs that would actually be easier in less than ideal conditions (such as walking and typing)? In those instances, when you know your context of use is going to be varied, running stats on something that was observed in the labs and trusting that as your only source of data is not going to give us good insight.

Qualitative data is ultimately what helps us improve the user experience.

Case in point: Statistics can tell us that an average rating of 2.9 out of 5 on overall user satisfaction could, with 95 percent confidence, range from 1.8 to 4.0 in the “true” population of users. Statistically speaking, with a larger sample, we could hone in on the average rating. For example, with a larger sample, we could be 95 percent confident that the “real” average rating lies somewhere between a 2.2 and 3.7. To a product manager, the difference between somewhere around 2 and somewhere near 4 may seem huge, but no matter what the true population average is, it will be valuable for the client to know why some users gave it a 1 or 2.

The point is that we can run more participants and be more precise with the range of the mean, but I would argue that the true value to the product manager is not in the numbers themselves, but in understanding why participants gave low ratings and how to improve the product.

This isn’t to diminish the role of statistics, especially when comparing groups or products, and even with small samples. It’s safe to bet, however, that most of our energies will continue to be spent on analyzing qualitative data to help our clients improve the usability of their products and enhance the overall user experience.

What sample size can “buy.”

Practically speaking, one of the most useful frameworks that Jeff highlighted is a method for thinking about sample size in terms of “bang for your buck.” When we are designing a study with a client, we can begin with a basic question: Are we looking to detect frequency of problems, measure the overall size of a metric, or compare one thing to another? If we are looking to detect problems, the less frequent the problem, the more users we need to test to uncover it.

A general rule of thumb is that there is good probability of detecting problems that affect 30 percent or more of your user population by testing only seven users. If we want to measure a population parameter like time on task, then we want a high degree of precision. Jeff’s 20/20 rule is that if you test 20 participants, your margin of error averages around 20 percent. If we are comparing things, we need to think about whether or not the difference we are looking to detect is going to be large (very obvious) or small (very subtle) and select a sample size that has an appropriate degree of power (or, confidence that a difference will be detected, if one exists.)

Don’t ask if it’s true, ask how it compares.

Practical power is that sample size costs money. It’s smart to err on the side of fewer participants because if you do find significance, you can know it is really there. It would have had to be a big effect in order to be detectable. Statistics is not about proving things, it’s about deduction. Clients will sometimes ask us, “Does an 85 percent success rate generalize to the population?” The real question should be, well, compared to what? Is there a difference between a 60 percent success rate (widget A) and an 85 percent success rate (widget B)? This is the difference between precision and hypothesis testing. The practical application of precision is repeatability, e.g., if you do the study again and again, will you get the same score? With hypothesis testing, we don’t care if we get the same scores on repeat tests. The scores could totally shift on another repeat study, but if a difference between A’s and B’s scores is still there, we can logically deduce that there is in fact a real difference between the two—with the exception of that 5-percent possibility we happened to find a difference by chance alone.

These are the important points that should remain top of mind whenever we think about incorporating statistical analysis or even discussion of statistical analysis with clients. It’s all about discussing the tradeoffs in order to reach the most reasonable approach, and finding the right mix of cost and benefit when adding stats to the analysis.

Kirsten Jerch is a Senior UX Lead Specialist, User Experience, at GfK with an extensive background in social science research methodology, user-centered design and human factors validation testing. To reach Kirsten directly, email kirsten.jerch@gfk.com.