In medical research papers, the selection of appropriate statistical methods serves as one of the pivotal premises to ensure the quality of papers and credibility of their results [1,2,3]. To correctly perform the statistical analysis of quantitative data, two key points should be considered. One is to identify the type of experimental design correctly, and the other is to check whether data meets the preconditions of parameter test [2,3,4]. Otherwise, it may cause different misuse in some situations and may even draw different or opposite conclusions about the same data.
As one of the most commonly used statistical methods in medical research papers, t-test can be divided into one-sample t-test and two-sample t-test [3, 4]. Thus, it is inappropriate to compare the means among multiple groups (more than three). Concretely, one-sample t-test is used to compare one group’s average value to a single number (a known population mean, for example, the norm). The two-sample ttest is a type of inferential statistic used to determine if there is a significant difference between the means of two groups. Furthermore, there are two types of two-sample t-test [3, 4]. One is independent sample t-test (group t-test), which is performed when the samples typically consist of independent population. The other is paired (or correlated) sample t-test, which is used when each observation in one group is paired with a related observation in the other group, i.e., the samples typically consist of matched pairs of similar units, or when there are cases of repeated measures.
Note that t-test belongs to the category of parametric test. The assumptions of the parametric test, including independence, normality, and homogeneity of variance, must be met to ensure the correct use of t-test [3, 4]. In addition, according to the theoretical deduction of t-test, it can only be applied to the quantitative data of single factor design, so it is inappropriate to perform t-test for multifactor design. For example, there are multiple independent variables/factors (such as gender and different types and dosage of drugs) and the comparisons among groups after controlling for simple effects of each independent variable.
As a journal editor and reviewer, we often encounter that some authors blindly use t-test to process quantitative data without analyzing the prerequisites of t-test or considering the type of experimental design, especially to independent sample t-test (group t-test). In order to improve the quality of statistical analysis in medical research papers, according to the problems found in the process of reviewing manuscripts, we summarized the following five most common misuses of t-test and analyzed them with examples. We hope that it can provide real help to improve our data analysis ability.
It is particularly noted that all the examples herein are artificially constructed for the purpose of illustration and do not represent actual clinical design and data. They are only for reference in the selection of statistical analysis methods.
Normal fitting tests, including the Shapiro-Wilk test for small sample size (n ≤ 50) or Kolmogorov-Smirnov test for large sample size (n > 50), usually require the analysis of the original data. However, there is a common and concise method to judge whether the data obey normal distribution, that is, to compare the mean and corresponding standard deviation (SD) of the data. If the mean is much smaller than its standard deviation, then the data may not obey the normal distribution, so t-test may also be inappropriate. In this case, it is better to perform t-test after an appropriate variable transformation (such as logarithm transforms and rank transforms) or perform nonparametric test method for original data.
A researcher adopts the independent sample t-test to compare the demographic data (age) between the experimental group and the control group. Table 1 provides the statistical results (see Additional file 1 for the original data). Is this appropriate?
Table 1 Statistical results of age between experimental group and control groupThe data are quantitative data for two independent samples under single factor design. However, from Table 1, we can find that the standard deviation is larger than its mean value in control group. Thus, the age in control group may not meet normal distribution. As a result, it may be inappropriate to analyze this data by the independent sample t-test directly.
In medical research, before-after study in the same patient is often used to compare the effect of a treatment factor (such as drug and operation). This is a typical self-matching experimental design type, which does not meet the independent assumption of independent sample t-test. In this case, the paired sample t-test is more suitable if the difference value is met normally distributed. Otherwise, the nonparametric test (Wilcoxon signed rank test) of two related samples is recommended.
In order to explore the effect of a certain treatment scheme on the scar of burn patients, the scar area of the patients is measured 1 day before and 1 week after treatment, respectively. And the independent sample t-test is used to compare the changes of scar area of the patients before and after treatment. Table 2 shows the statistical results (see Additional file 2 for the original data). Is this appropriate?
Table 2 Evaluation of scar area of burn patients before and after treatmentClearly, the independent assumption of independent sample t-test is not satisfied under the study protocol, and independent sample t-test is inappropriate for the data.
The single factor k-level (k ≥ 3) independent sample design is a widely used experimental design method in medical experiments. For example, to investigate the difference of a physiological index with different disease types, we measured the index of patients with k (k ≥ 3) disease types. In this case, we need to compare the means among k independent samples and determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. Because direct multiple use of independent samples t-test will increase the probability of type I error, one-way analysis of variance (ANOVA) is more suitable at this time. If the one-way ANOVA returns a statistically significant result, we accept the alternative hypothesis, which is that there are at least two group means that are statistically significantly different from each other. To determine which specific groups differed from each other, we further need to perform post hoc multiple comparisons. If we want to compare each group with the control group, Dunnett’s test is recommended.
For a new antihypertensive drug, we hope to compare the antihypertensive effect of high- and low-dose groups with that of placebo group. The independent sample t-test is adopted to compare the low-dose group with the placebo group and the high-dose group with the placebo group, respectively. The statistic results are presented in Table 3 (see Additional file 3 for the original data). Is this appropriate?
Table 3 Statistical comparison of antihypertensive effectsThese data are typical quantitative data of multigroup independent sample design, also known as the single factor design with multiple levels, and the number of levels is 3. Thus, it is not appropriate to perform the independent sample t-test directly for comparisons with control group.
According to the study design, selecting “Analyze➔Compare Means➔One-Way ANOVA…” and ticking “Dunnett” in the “Post Hoc Multiple Comparisons” dialog box in SPSS, we perform one-way ANOVA and Dunnett’s post hoc test to compare each dose group with the placebo group. The results indicate that there is a statistically significant difference between groups as determined by one-way ANOVA (F = 24.728, p < 0.001). The results of multiple comparisons show that the difference between low-dose group and placebo group is not statistically significant (p = 0.069), which is completely contrary to the results of the independent sample t-test (Table 3). The difference between high-dose group and placebo group is still statistically significant (p < 0.001).
To understand the effect of two or more independent variables upon a single dependent variable, completely randomized factorial design is often used in medical experiments or clinical trials. A factor is a variable that is controlled and varied during the course of an experiment. In a factorial design, there are two or more factors with multiple levels that are crossed, e.g., two dose levels of drug A and two levels of drug B can be crossed to yield a total of four treatment combinations. Factorial designs offer certain advantages over conventional designs. The design can examine not only the differences among the levels of each factor, but also the interactions among the factors. For quantitative data of factorial design, direct multiple use of independent sample t-test will not only increase the probability of type I error, but also lead to wrong conclusions when there is interaction between various factors. A more appropriate method at this point is to perform ANOVA of factorial design. Taking two factors of independent samples as an example, it is also called the two-way ANOVA of independent samples.
To study the difference of pain score between patients with different disease types (burn, trauma, and arthritis) after receiving two treatment schemes (named as scheme A and scheme B), ten patients were recruited for each type of disease and randomly assigned to the possible treatment schemes with equal possibility. For the measured pain scores, independent sample t-tests are performed repeatedly to compare the difference between disease types and treatment schemes. Table 4 shows the statistical results (see Additional file 4 for the original data). Is this appropriate?
Table 4 Comparison of pain scores of patients with three disease types and two treatment schemesThis study involves two factors. One is treatment factor with two levels, scheme A and scheme B, while the other is disease type factor with three levels, burns, trauma, and arthritis. Since the patients in each level combination are different, the samples are independent. Therefore, this study belongs to the 2 × 3 factorial design, and the ANOVA of factorial design should be performed for comparative analysis. Firstly, the interaction effect between the factors should be tested. If the interaction effect is not statistically significant, the main effect of each factor can be analyzed. Otherwise, the individual effect of each factor needs to be analyzed separately.
Repeated measurement designs are commonly used in longitudinal studies, such as the dynamic changes over time of temperature, blood pressure, and other indicators, which is often encountered in medical research. The purpose is usually to detect whether there is a statistical significance in the difference of the indicator values at different time points. In practice, many authors usually calculate the mean and standard deviation of each time point, and then carry out independent sample t-test repeatedly for each time point. However, according to the design principle, we know that repeated measures design uses the same subjects with every condition of the research, including the control. Thus, the measurements at different time points are correlated with each other, that is, the samples at different time points are not independent of each other. Roughly speaking, such data are often time-dependent. In this case, the appropriate analysis method is ANOVA of repeated measures designs. If there is another factor with independent samples, two-way ANOVA with mixed samples is recommended.
To study the difference for a certain indicator at different postoperative time points, 10 patients (5 males and 5 females) are enrolled in the study and the indictor of each of them is measured at 1, 2, 4, and 8 weeks after the operation. The researchers use the independent sample t-test to analyze the difference of this indictor of different time points. The statistical results are presented in Table 5 (see Additional file 5 for the original data). Is this appropriate?
Table 5 Comparison of a certain indicator at different postoperative time pointsAccording to the experimental process of this study, the indicators of each patient are repeatedly measured at 1 week, 2 weeks, 4 weeks, and 8 weeks after the surgery, so the postoperative time serves as a factor of repeated measurement with four levels. In addition, gender is another factor, which is an independent sample at each level. Thus, the overall design was separated by pairwise comparison at different time points through independent sample t-test and fails to take into account the fact that the data on the same subject at different time points are not independent.
In summary, in order to effectively reduce misuse of statistical methods and improve credibility of the statistical results, it is necessary to carefully consider the experimental design type, distribution characteristics of the data, and other relevant factors. Concretely, we should meticulously review the applicable preconditions of each statistical analysis technique and reasonably select the appropriate method before analysis of quantitative data. In this paper, the five cases of most commonly misused t-tests are summarized, with the causes of each misuse analyzed and the more appropriate statistical methods are also offered in SPSS. By doing so, we believe that this paper can be helpful to the writing and editing of biomedical research papers.
All artificially constructed data are presented in the tables and additional files.