Comparing More Than Two Means:
One-Way ANOVA
Copyright © 2009–2023 by Stan Brown, BrownMath.com
Copyright © 2009–2023 by Stan Brown, BrownMath.com
When you have several means to compare, it’s not valid just to compare all possible pairs with t tests. Instead, you follow a two-stage process:
The factor that varies between samples is called the factor. (Every once in a while things are easy.) The r different values or levels of the factor are called the treatments. Here the factor is the choice of fat and the treatments are the four fats, so r = 4.
The computations to test the means for equality are called a 1-way ANOVA or 1-factor ANOVA.
g Fat Absorbed in Batch | x̅ | s | ||||||
---|---|---|---|---|---|---|---|---|
Fat 1 | 64 | 72 | 68 | 77 | 56 | 95 | 72 | 13.34 |
Fat 2 | 78 | 91 | 97 | 82 | 85 | 77 | 85 | 7.77 |
Fat 3 | 75 | 93 | 78 | 71 | 63 | 76 | 76 | 9.88 |
Fat 4 | 55 | 66 | 49 | 64 | 70 | 68 | 62 | 8.22 |
source: Snedecor 1989 [full citation in “References”, below] pp 217–218 |
Hoping to produce a donut that could be marketed to health-conscious consumers, a company tried four different fats to see which one was least absorbed by the donuts during the deep frying process. Each fat was used for six batches of two dozen donuts each, and the table shows the grams of fat absorbed by each batch of donuts.
It looks like donuts absorb the most of Fat 2 and the least of Fat 4, with intermediate amounts of Fat 1 and Fat 3. But there’s a lot of overlap, too: for instance, even though the mean for Fat 2 is much higher than for Fat 1, one sample of Fat 1, 95 g, is higher than five of the six samples of Fat 2.
Nevertheless, the sample means do look different. But what about the population means? In other words, would the four fats be absorbed in different if you made a whole lot of batches of donuts — do statistics justify choosing one fat over another? This is the basic question of a hypothesis test or significance test: is the difference great enough that you can rule out chance?
If Fats 2 and 4 were the only ones you had data for, you’d do a good old 2-sample t test. So why can’t you do that anyway? because that would greatly increase your chances of a Type I error. The reasons are given in the Appendix.
By the way, though usually you are interested in the differences between population means with various treatments, you can also estimate the individual means. If you’re interested, see Estimating Individual Treatment Means in the Appendix.
The ANOVA procedure tests these hypotheses:
H0: μ1 = μ2 = … = μr, all the means are the same
H1: two or more means are different from the others
Let’s test these hypotheses at the α = 0.05 significance level.
You might wonder why you do analysis of variance to test means, but this actually makes sense. The question, remember, is whether the observed difference in means is too large to be the result of random selection. How do you decide whether the difference is too large? You look at the absolute difference of means between treatments (samples), but you also consider the variability within each treatment. Intuitively, if the difference between treatments is a lot bigger than the difference within treatments, you conclude that it’s not due to random chance and there is a real effect.
And this is just how ANOVA works: comparing the variation between groups to the variation within groups. Hence, analysis of variance.
Miller 1986 [full citation in “References”, below] (pages 90–91) is more cautious. When sample sizes are equal but standard deviations are not, the actual p-value will be slightly larger than what you find in the tables. But when sample sizes are unequal, and the smaller samples have the larger standard deviations, the actual p-value “can increase dramatically above” what the tables say, even “without too much disparity” in the standard deviations. “Falsely reporting significant results when the small samples have the larger variances is a serious worry. The lesson to be learned is to balance the experiment [equal sample sizes] if at all possible.”
A 1-way ANOVA tests whether the means of all groups are equal for different levels of one factor, using some fairly lengthy calculations. You could do all the computations by hand as shown in the Appendix, but no one ever does. Here are some alternatives:
STAT
] [◄
] [▲
] to select ANOVA
,
and enter the list names separated by commas.When you use a calculator or computer program to do ANOVA, you get an ANOVA table that looks something like this:
SS | df | MS | F | p | |
---|---|---|---|---|---|
Between groups (or “Factor”) |
1636.5 | 3 | 545.4 | 5.41 | 0.0069 |
Within groups (or “Error”) |
2018.0 | 20 | 100.9 | ||
Total | 3654.5 | 23 |
Note that the mean square between treatments, 545.4, is much larger than the mean square within treatments, 100.9. That ratio, between-groups mean square over within-groups mean square, is called an F statistic (F = MSB/MSW = 5.41 in this example). It tells you how much more variability there is between treatment groups than within treatment groups. The larger that ratio, the more confident you feel in rejecting the null hypothesis, which was that all means are equal and there is no treatment effect.
But what you care about is the p-value of 0.0069, obtained from the F distribution. The p-value has the usual interpretation: the probability of the between-treatments MS being ≥5.41 times the within-treatments MS, if the null hypothesis is true, is p = 0.0069.
The p-value is below your significance level of 0.05: it would be quite unlikely to have MSB/MSW this large if there were no real difference among the means. Therefore you reject H0 and accept H1, concluding that the mean absorption of all the fats is not the same.
An interesting extra parameter can be derived from the ANOVA table; see η˛: Strength of Association in the Appendix below.
Now that you know that it does make a difference which fat is used, you naturally want to know which fats are significantly different. This is post-hoc analysis. There are several different post-hoc analyses, and no one is superior on all points, but the most common choice is the Tukey HSD.
If your ANOVA test shows that the means aren’t all equal, your next step is to determine which means are different, to your level of significance. You can’t just perform a series of t tests, because that would greatly increase your likelihood of a Type I error. So what do you do?
John Tukey gave one answer to this question, the HSD (Honestly Significant Difference) test. You compute something analogous to a t score for each pair of means, but you don’t compare it to the Student’s t distribution. Instead, you use a new distribution called the studentized range or q distribution.
Caution: Perform post-hoc analysis only if the ANOVA test shows a p-value less than your α. If p>α, you don’t know whether the means are all the same or not, and you can’t go fishing for unequal means.
You generally want to know not just which means differ, but by how much they differ (the effect size). The easiest thing is to compute the confidence interval first, and then interpret it for a significant difference in means (or no significant difference). You’ve already seen this relationship between a test of significance at the α level and a 1−α confidence interval:
You compute that confidence interval similarly to the confidence interval for the difference of two means, but using the q distribution which avoids the problem of inflating α:
where x̅i and x̅j are the two sample means, ni and nj are the two sample sizes, MSW is the within-groups mean square from the ANOVA table, and q is the critical value of the studentized range for α, the number of treatments or samples r, and the within-groups degrees of freedom dfW. The square-root term is called the standardized error (as opposed to standard error).
Using the studentized range, developed by Tukey, overcomes the problem of inflated significance level that I talked about earlier. If sample sizes are equal, the risk of a Type I error is exactly α, and if sample sizes are unequal it’s less than α: the procedure is conservative. In terms of confidence intervals, if the sample sizes are equal then the confidence level is the stated 1−α, but if the sample size are unequal then the actual confidence level is greater than 1−α (NIST 2012 [full citation in “References”, below] section 7.4.7.1).
Usually the comparisons are presented in a table, like this one for the example with frying donuts:
x̅i−x̅j | Critical q q(α,r,dfW) |
Standardized error |
95% Conf Interval for μi−μj |
Signif at 0.05? | ||
---|---|---|---|---|---|---|
Fat 1 − Fat 2 | −13 | 3.9597 | 4.1008 | −29.2 | 3.2 | |
Fat 1 − Fat 3 | −4 | 3.9597 | 4.1008 | −20.2 | 12.2 | |
Fat 1 − Fat 4 | 10 | 3.9597 | 4.1008 | −6.2 | 26.2 | |
Fat 2 − Fat 3 | 9 | 3.9597 | 4.1008 | −7.2 | 25.2 | |
Fat 2 − Fat 4 | 23 | 3.9597 | 4.1008 | 6.8 | 39.2 | YES |
Fat 3 − Fat 4 | 14 | 3.9597 | 4.1008 | −2.2 | 30.2 |
How do you read the table, and how was it constructed? Look first at the rows. Each row compares one pair of treatments.
If you have r treatments, there will be r(r−1)/2 pairs of means. The “/2” part comes because there’s no need to compare Fat 1 to Fat 2 and then Fat 2 to Fat 1. If Fat 1 is absorbed less than Fat 2, then Fat 2 is absorbed more than Fat 1 and by the same amount.
Now look at the columns. I’ll work through all the columns of the first row with you, and you can interpret the others in the same way.
For this experiment, we had four treatments and dfW from the ANOVA table was 20, so we need q(0.05, 4, 20). Your textbook may have a table of critical values for the studentized range, or you can look up q in an online table such as the one at the end of Abdi and Williams 2010 [full citation in “References”, below], or find it with an online calculator like Lowry 2001a [full citation in “References”, below]. Most textbooks don’t have a table of q, and the TI calculators can’t compute it.)
Different sources give slightly different critical values of q, I suspect because q is extremely difficult to compute. One value I found was q(0.05,4,20) = 3.9597.
In an experiment with unequal sample sizes, the standardized error would vary for comparing different pairs of treatments. But in this experiment, every treatment has six data points, and so the standardized error is the same for every pair of means:
√(MSW/2)·(1/6+1/6) = √(100.9/2)·(2/6) = 4.1008
Interpretation: You’re 95% confident that, on average, a batch of 24 donuts absorbs between 29.2 g less and 3.2 g more of Fat 1 than Fat 2.
The confidence interval for the difference between Fat 1 and Fat 2 goes from a negative to a positive, so it does include zero. That means the two fats might have the same or different absorption, so you can’t say whether there’s a difference.
Caution: It’s generally best not to say that there is no significant difference. Even though that’s literally true, it’s easily misinterpreted to mean that the absorption of the two fats is the same, and you don’t know that. It might be, and it might not be. Stick to neutral language.
On the other hand, when the endpoints of the confidence interval are both positive or both negative, then 0 is not in the interval and we reject the null hypothesis of equality. In this table, only Fats 2 and 4 have a significant difference.
Interpretation: Fats 2 and 4 are not equally absorbed in frying donuts, and we’re 95% confident that a batch of 24 donuts absorbs 6.8 g to 30.2 g more of Fat 2 than Fat 4.
It’s possible to make more complicated comparisons. For instance, with a control group and two treatments you might compare the mean of the control group to the average of the means of the two treatments. Any kind of linear comparison can be done using a procedure developed by Henry Scheffé. A good brief explanation of Scheffé’s method is at NIST 2012 [full citation in “References”, below] section 7.4.7.2.
Tukey’s method is best when you are simultaneously comparing all pairs of means. If you have pre-selected a subset of means to compare, the Bonferroni method (NIST 2012 [full citation in “References”, below] section 7.4.7.3) may be better.
5-year Rates of Return | |||
---|---|---|---|
Financial | Energy | Utilities | |
10.76 | 12.72 | 11.88 | |
15.05 | 13.91 | 5.86 | |
17.01 | 6.43 | 13.46 | |
5.07 | 11.19 | 9.90 | |
19.50 | 18.79 | 3.95 | |
8.16 | 20.73 | 3.44 | |
10.38 | 9.60 | 7.11 | |
6.75 | 17.40 | 15.70 | |
x̅ | 11.585 | 13.846 | 8.913 |
s | 5.124 | 4.867 | 4.530 |
source: morningstar.com via Sullivan 2011 [full citation at https://BrownMath.com/swt/sources.htm#so_Sullivan2011] page C–30 (on CD) |
A stock analyst randomly selected eight stocks in each of three industries and compiled the five-year rate of return for each stock. The analyst would like to know whether any of the industries have a different rate of return from the others, at the 0.05 significance level.
Solution: The hypotheses are
H0: = μF = μE = μU, all three industries have the same average rate of return
H1: the industries don’t all have the same average rate of return
You can use a normal probability plot to assess normality for each sample; see MATH200A Program part 4. The standard deviations of the three samples are fairly close together, so the requirements are met.
Here is the ANOVA table:
SS | df | MS | F | p | |
---|---|---|---|---|---|
Between groups (or “Factor”) |
97.5931 | 2 | 48.7965 | 2.08 | 0.1502 |
Within groups (or “Error”) |
493.2577 | 21 | 23.4885 | ||
Total | 590.8508 | 23 |
The F statistic is only 2.08, so the variation between groups is only about double the variation within groups. The high p-value makes you fail to reject H0 and you cannot reach a conclusion about differences between average rates of returns for the three industries.
Since you failed to reject H0 in the initial ANOVA test, you can’t do any sort of post-hoc analysis and look for differences between any particular pairs of means. (Well, you can, but you know in advance that all of the intervals will include zero, meaning that you don’t know whether any particular sector has a different return from any other sector or not.)
Lifetime, hr | x̅ | s | |
---|---|---|---|
Type A | 407 411 409 | 409 | 2.0 |
Type B | 404 406 408 405 402 | 405 | 2.2 |
Type C | 410 408 406 408 | 408 | 1.6 |
source: Spiegel and Stephens 1999 [full citation in “References”, below], pp 378–379 |
A company makes three types of high-performance CRTs. A random sample finds lifetimes shown in the table at right. At the 0.05 level, is there a difference in the average lifetimes of the three types?
Solution: Your hypotheses are
H0: μA = μB = μC, the three types have equal mean lifetime
H1: the three types don’t all have the same mean lifetime
Excel or the TI-83/84 gives you this ANOVA table:
SS | df | MS | F | p | |
---|---|---|---|---|---|
Between groups (or “Factor”) |
36 | 2 | 18 | 4.50 | 0.0442 |
Within groups (or “Error”) |
36 | 9 | 4 | ||
Total | 72 | 11 |
p<α, so you reject H0 and accept H1, concluding that the three types don’t all have the same mean lifetime.
Since you were able to reject the null hypothesis, you can proceed with post-hoc analysis to determine which means are different and the size of the difference. Here is the table:
x̅i−x̅j | Critical q q(α,r,dfW) |
Standardized error |
95% Conf Interval for μi−μj |
Signif at 0.05? | ||
---|---|---|---|---|---|---|
Type A − Type B | 4 | 3.9508 | 1.0328 | −0.1 | 8.1 | |
Type A − Type C | 1 | 3.9508 | 1.0801 | −3.3 | 5.3 | |
Type B − Type C | −3 | 3.9508 | 0.9487 | −6.7 | 0.7 |
This result might surprise you: although the three means aren’t all equal, you can’t say that any two of the means are unequal. But when you look more closely at the numbers, this doesn’t seem quite so unreasonable.
First, look at the p-value in the ANOVA table: 0.0442 is below 0.05, yes, but it’s not very far below. There’s almost a 4˝% chance that we’re committing a Type I error in rejecting H0. Next, look at the confidence interval μA−μB. While the interval does include 0, it’s extremely lopsided and almost doesn’t include 0.
Though we’re used to thinking of significance as “either it is or it isn’t”, there are cases where the decision is a close one, and this is one of those cases. And the confidence intervals are computed by a different method than the significance test, using a different distribution. Here again, the decision is a close one. So what we have is two close decisions, based on different computations, one falling slightly on one side of the line and the other falling slightly on the other side of the line. It’s a good reminder that in statistics we’re dealing with probabilities, not certainties.
The following sections are for students who want to know more than just the bare bones of how to do a 1-way ANOVA test.
Remember that you have to set up hypotheses up before you know the data. Before you’ve actually fried the donuts, you have no reason to expect any particular outcome. Specifically, until you have the data you have no reason to think Fats 2 and 4 are any more different than Fats 1 and 4, or any other pair.
Why can’t you collect the data and then select your hypotheses? Because that can put significance on a chance event. For example, a golfer hits a ball and it lands on a particular tuft of grass. The probability of landing on that particular tuft is extremely small, so there’s something different about that particular tuft, right? Obviously not! It’s a logical fallacy to decide what to test after you already have the data.
So if you want to do a 2-sample t test in differences among four fats you would have to test every pair of fats: 1 and 2, 1 and 3 1 and 4, 2 and 3, 2 and 4, 3 and 4. That’s six hypotheses in all.
Well, why not do a 0.05 significance test on pair of means? Remember what a 0.05 significance level means: you’re willing to accept a 5% chance of a Type I error, rejecting H0 when it’s actually true. But if you test six 0.05 hypotheses on the same set of data, you’re much more likely to commit a Type I error. How much more likely? Well, for each hypothesis there’s a 95% chance of escaping a Type I error, but the probability of escaping a Type I error six times in a row is 0.956 = 0.7351. 1−0.7351 = 0.2649, so if you test all six pairs at the 0.05 level, you’re more likely than one chance in four to get a false positive, finding a difference between two fats when there’s actually no difference.
Prob. of Type I Error | |||
---|---|---|---|
groups | pairs | α = 0.05 | α = 0.01 |
3 | 3 | 0.1426 | 0.0297 |
4 | 6 | 0.2649 | 0.0585 |
5 | 10 | 0.4013 | 0.0956 |
6 | 15 | 0.5367 | 0.1399 |
In general, if you have r treatments, there are r(r−1)/2 pairs of means to compare. If you test each pair at significance level α, the overall probability of a Type I error is 1 − (1−α)r(r−1)/2. The table at right shows the effective α for various numbers of treatments when the nominal α is 0.05 or 0.01. You can see that testing multiple hypotheses increases your α dramatically. Even with just three treatments, the effective α is almost three times the nominal α. This is clearly unacceptable.
Why not just lower your alpha? Because as you lower your α you increase your β, the chance of a Type II error. β represents the probability of a false negative, failing to find a difference in fats when there actually is a difference. This, too, is unacceptable.
So you have to find a way to test all the pairs of means at the same time, in one test. The solution is an extension of the t test to multiple samples, and it’s called ANOVA. (If you have only two treatments, ANOVA computes the same p-value as a two-sample t test, but at the cost of extra effort.)
How does the ANOVA procedure compute a p-value? This section shows you the formulas and carries through the computations for the example with fat for frying donuts.
Remember, long ago in a galaxy called Descriptive Statistics, how the variance was defined: find the mean, then for each data point take the square of its difference from the mean. Add up all those squares, and you have SS(x), the sum of squared deviations in x. The variance was SS(x) divided by the degrees of freedom n−1, so it was a kind of average or mean squared deviation. You probably learned the shortcut computational formulas:
SS(x) = ∑x˛ − (∑x)˛/n or SS(x) = ∑x˛ − nx̅˛
and then
s˛ = MS(x) = SS(x)/df where df = n−1
In 1-way ANOVA, we extend those concepts a bit. First you partition SS(x) into between-treatments and within-treatments parts, SSB and SSW. Then you compute the mean square deviations:
Finally you divide the two to obtain your test statistic, F = MSB/MSW, and you look up the p-value in a table of the F distribution.
(The F distribution is named after “the celebrated R.A. Fisher” (Kuzma & Bohnenblust 2005 [full citation at https://BrownMath.com/swt/sources.htm#so_Kuzma2005], 176). You may have already seen the F distribution in computing a different ratio of variances, as part of testing the variances of two populations for equality.)
There are several ways to compute the variability, but they all come up with the same answers and this method in Spiegel and Stephens 1999 [full citation in “References”, below] pages 367–368 is as easy as any:
SS | df | MS | F | |
---|---|---|---|---|
Between groups (or “Factor”) |
SSB = ∑njx̅j˛−Nx̅˛ | dfB = r−1 | MSB = SSB/dfB | F = MSB/MSW |
Within groups (or “Error”)* |
SSW = SStot−SSB | dfW = N−r | MSW = SSW/dfW | |
Total* | SStot = ∑x˛−Nx̅˛ | dftot = N−1 | ||
* or, if you know the standard deviations of the samples, | ||||
SSW = ∑(nj−1)sj˛
SStot = SSB + SSW |
where
x̅ = ∑njx̅j/N
You begin with the treatment means x̅j={72, 85, 76, 62} and the overall mean x̅=73.75, then compute
SSB = (6×72˛+6×85˛+6×76˛+6×62˛) − 24×73.75˛ = 1636.5
MSB = 1636.5 / 3 = 545.4
The next step depends on whether you know the standard deviations sj of the samples. If you don’t, then you jump to the third row of the table to compute the overall sum of squares:
∑x˛ = 64˛ + 72˛ + 68˛ + … + 70˛ + 68˛ = 134192
SStot = ∑x˛ − Nx̅˛ = 134192 − 24×73.75˛ = 3654.5
Then you find SSW by subtracting the “between” sum of squares SSB from the overall sum of squares SStot:
SSW = SStot−SSB = 3654.5−1636.5 = 2018.0
MSW = 2018.0 / 20 = 100.9
Now you’re almost there. You want to know whether the variability between treatments, MSB, is greater than the variability within treatments, MSW. If it’s enough greater, then you conclude that there is a real difference between at least some of the treatment means and therefore that the factor has a real effect. To determine this, divide
F = MSB/MSW = 5.41
This is the F statistic. The F distribution is a one-tailed distribution that depends on both degrees of freedom, dfB and dfW.
At long last, you look up F=5.41 with 3 and 20 degrees of freedom, and you find a p-value of 0.0069. The interpretation is the usual one: there’s only a 0.0069 chance of getting an F statistic greater than 5.41 (or higher variability between treatments relative to the variability within treatments) if there is actually no difference between treatments. Since the p-value is less than α, you conclude that there is a difference.
Usually you’re interested in the contrast between two treatments, but you can also estimate the population mean for an individual treatment. You do use a t interval, as you would when you have only one sample, but the standard error and degrees of freedom are different (NIST 2012 [full citation in “References”, below] section 7.4.3.6).
To compute a confidence interval on an individual mean for the jth treatment, use
df = dfW
standard error = √MSW/nj
Therefore the margin of error, which is the half-width of the confidence interval, is
E = t(α/2,dfW) · √MSW/nj
Example: Refer back to the fats for frying donuts. Estimate the population mean for Fat 2 with 95% confidence? In other words, if you fried a great many batches of donuts in Fat 2, how much fat per batch would be absorbed, on average?
Solution: First, marshal your data:
sample mean for Fat 2: x̅2 = 85
sample size: n2 = 6
degrees of freedom: dfW = 20 (from the ANOVA table)
MSW = 100.9 (also from the table)
1−α = 0.95
TI-83 or TI-84 users, please see an easy procedure below.
Begin by finding the critical t. Since 1−α = 0.95, α/2 = 0.025. You therefore need t(0.025,20). You can find this from a table:
t(0.025,20) = 2.0860
Next, find the standard error. This is
standard error = √MSW/nj = √100.9/6 = 4.1008
Now you’re ready to finish the confidence interval. The margin of error is
E = t(α/2,df) · √MSW/nj = 2.0860×4.1008 = 8.5541
Therefore the confidence interval is
μ2 = 85 ± 8.6 g (95% confidence)
or
76.4 g ≤ μ2 ≤ 93.6 g (95% confidence)
Conclusion: You’re 95% confident that the true mean amount of fat absorbed by a batch of donuts fried in Fat 2 is between 76.4 g and 93.6 g.
Your TI calculator is set up to do the necessary calculations, but there’s one glitch because the degrees of freedom is not based on the size of the individual sample, as it is in a regular t interval. So you have to “spoof” the calculator as follows.
Press [STAT
] [◄
] [8
] to bring up the TInterval
screen. First I’ll tell you what to enter; then I’ll
explain why.
Now, what’s up with n and Sx? Well, the calculator uses n to compute degrees of freedom for critical t as n−1. You want degrees of freedom to be dfW, so you lie to the calculator and enter the value of n as dfW+1 (20+1 = 21).
But that creates a new problem. The calculator also divides s by √n to come up with the standard error. But you want it to use nj (6) and not your fake n (21). So you have to multiply MSW by dfW+1 and divide by nj to trick the calculator into using the value you actually want.
By the way, why is MSW inside the square root sign? Because the calculator wants a standard deviation, but MSW is a variance. As you know, standard deviation is the square root of variance.
All this fakery achieves the desired result: the confidence interval matches the one that you would have if you computed it by hand.
Conclusion: You’re 95% confident that the true mean amount of fat absorbed by a batch of donuts fried in Fat 2 is between 76.4 g and 93.6 g.
Lowry 1988 [full citation in “References”, below] chapter 14 part 2 mentions a measure that is usually neglected in ANOVA: η˛. (η is the Greek letter eta, which rhymes with beta.)
η˛ = SSB/SStot, the ratio of sum of squares between groups to total sum of squares. For the donut-frying example,
η˛ = SSB/SStot = 1636.5 / 3654.5 = 0.45
What does this tell you? η˛ measures how much of the total variability in the dependent variable is associated with the variation in treatments. For the donut example, η˛ = 0.45 tells you that 45% of the variability in fat absorption among the batches is associated with the choice of fat.
Updates and new info: https://BrownMath.com/stat/