Focus on Alternative and Complementary Therapies
www.pharmpress.com/fact
Focus Alternat Complement Ther©2005 Pharmaceutical Press
Focus Altern Complement Ther 2005; 10: 275–7
One important decision to be made in any CAM research is the number and type of people you need. Indeed ethics research forms ask for information about this. Typically there will be a question such as ‘Has the size of the study been informed by a formal statistical power calculation?’ Moreover, in the peer review process many editors of journals ask for evidence of an a priori sample size calculation. Confidence intervals (CIs) are important in the clinical interpretation of quantitative results. This article discusses why these issues are important.
There is very good reason to ask for justification of your intended sample size. Any quantitative study should have a clear research objective and associated research question. It is unethical to enter people into a study that has little chance of answering the research question. You might therefore think that you should recruit as many people as possible within the resource constraints of your project. This is not good enough. Firstly, you would have no guarantee that this number was sufficiently large to have a good chance of answering your research question. Secondly, it can be unethical to run a study that is too large. Imagine that you are running a randomised controlled trial comparing reflexology with foot massage for anxiety in menopausal women. Before running the trial you should have sufficient belief that reflexology is (or at least may be) superior to massage, but insufficient evidence to demonstrate this scientifically. No person entering a study should knowingly be given inferior treatment; in fact you would hope that some of them would receive superior treatment (in this case reflexology). The people you recruit should understand this. Suppose that reflexology was indeed superior to foot massage and that you ran a study that had twice as many people as was needed. It is highly likely that the second half of your study would be unethical. An analysis of the data from the first half of your study should have been adequate to answer the question and then you would no longer offer the massage when there was evidence of the superiority of reflexology. It is also an unethical waste of resources to run an over-sized study. However, over-sized studies are rare in medical research and, owing to a scarcity of resources, virtually unheard of in CAM research. You therefore need a study that is big enough, but not too big.
The philosophy behind the analysis of such a trial is something like this. You assume that there is no difference on average in the responses of the two groups and you will believe this unless there is ‘compelling’ evidence against it. You measure anxiety in each group after the courses of treatments. You are hoping that the average response in the reflexology group will be better than that in the massage group, but such a difference could happen by chance. You need to decide whether an observed average difference between the two groups is evidence of a real treatment difference or whether actually there is no real difference and the apparent effect is a result of chance. You do this with a suitable statistical test. After doing the test you will decide either that there is evidence of a difference between the two treatments or that there is not. You cannot be sure! If you want to be conservative and only present a positive finding if the evidence is very strong then you risk the mistake of incorrectly presenting a negative finding. On the other hand, if you are less cautious you run the risk of mistakenly deciding that there is evidence of a difference when there is not. Probabilities are used. It is usual to set the chance of making the mistake of deciding that there is a difference when there is not at 5%, i.e. 1 in 20. This means that if 20 people ran a study like yours, and there was no difference on average between the effects of reflexology and foot massage, you would expect one of the 20 to conclude that one treatment was superior to the other. This value (e.g. 5%) is called the level of the test and determines the interpretation of the P value associated with the statistical test. Most people get this part right in the analysis. Fewer people consider the power of a study.
The power is the chance that you will find an effect (or difference) that is really there. Clearly, you want this to be a high probability. A commonly recommended minimum value is 80%; studies with a power of less than 60% are of very poor quality. Some ethics committees now ask for 90% power. You cannot retrospectively get the power right – you have to consider this early on in the design process.
If your result is negative, i.e. you do not find compelling evidence of a difference, you cannot conclude that there is no difference. ‘Absence of evidence is not evidence of absence’. All you can say is that you have failed to find one. If your study had low power then this is very possibly the reason; you have failed to find a difference because your study was not designed to find it. Put like that it sounds foolish, and so it is.
Some researchers acknowledge that their study is too small (i.e. under-powered) and try to justify this by labelling it as a ‘pilot’ study. The practice of piloting a questionnaire before carrying out a full sample survey is quite well understood. The purpose is to detect any problems in the design and to rectify them before undertaking the large study. The literature is much less clear about what a pilot study is in general. Any study (including a pilot study) should have clear aims and research questions, and the design and size should both be compatible with these. Sadly, this is not always the case for studies labelled as ‘pilot’. Moreover, researchers, even for pilot studies, rely far too much on P values and do not always interpret results correctly.
While P values are important, they are of limited clinical value in interpreting results. Suppose that the trial of reflexology gives a statistically significant result, i.e. ‘there is a significant difference (P < 5%)’. This information does not tell you very much at all. What you really want to know is the nature of the difference: which treatment was more effective and by how much. This is best done with a CI. Suppose that in our trial of reflexology the average score in the reflexology group is 34 and the average in the foot massage group is 26. The difference is 8 points. So our best estimate of the average difference in general (in the population) is 8. But a CI gives us more information. Suppose the 95% CI for this difference is 4 to 12. This means that we are 95% sure that in the population of women with anxiety in menopause on average they would score less following reflexology by an amount between 4 and 12 on that scale. We do not know where in this range the true value is likely to be: it could be as low as 4 or as high as 12. (It could be outside this range, but this is unlikely.)
CIs are extremely important in the interpretation of studies. In fact, without loss of useful information we could scrap reporting of P values if CIs were presented, but the converse is certainly not true. All papers should really present CIs. It is the CI that allows clinical interpretation of the findings. The P value merely allows you to say that the result is, or is not, statistically significant.
In order to illustrate this consider Figure 1. The five vertical lines represent 95% CIs for the treatment effects from five studies, A, B, C, D and E. The solid horizontal line is zero – no difference. The dashed horizontal line represents the size of the difference deemed to be clinically important. Trials A, B and C all give a statistically significant difference at the 5% level because they do not include the value zero. However, the interpretations are different. For trial A the 95% CI lies entirely above the dashed line, so we can be reasonably confident that the population value is clinically important. The evidence from trial B is slightly less compelling because the 95% CI contains some values that are too small to be clinically important. On the other hand, the 95% CI for trial C lies entirely below the dashed line and although the difference is statistically significant the actual value may be too small to be of clinical importance. The 95% CIs for trials D and E all contain zero, so their P values will be greater than 5%. However, they too represent different scenarios. The CI for trial D contains no values deemed to be of clinical importance, so this is probably a true negative finding. However, the situation for trial E is quite different. In this case the CI contains zero, some negative values and some values large enough to be clinically important. This is the sort of CI that is obtained from an under-powered study and the result is largely inconclusive. Sadly, such results are still reported: worse still they are sometimes interpreted incorrectly.
Figure 1. 95% confidence intervals for the treatment effect from five different trials. Differences for A, B and C are statistically significant at the 5% level; D and E are not. However, the interpretation of the results is different in each case.
CIs are essential in equivalence studies where the aim is to show no difference rather than to show a difference. As more and more treatments become available these designs become more important. In CAM you may wish to show that a therapy is ‘equivalent’ in outcome to a more conventional treatment. Such studies need very careful design, but in order to show no difference the 95% CI must contain no clinically important values: in fact it must contain only values of no importance. The cut-off value for no importance should be lower than that for clinical importance, but you are aiming for CIs like those for studies C and D. Equivalence studies generally require larger samples than superiority trials. For these designs it is most certainly not sufficient to report a P value greater than 5%.
The discussion here has focused on trials, but the principles apply to all quantitative studies. Moreover, they apply to all outcomes in studies. Even in a trial there will be some outcomes that are deemed more important than others. While a trial may have adequate power for these primary outcomes it may not have adequate power for secondary, less important outcomes. The prespecification of these primary outcome measures is important, otherwise researchers can ‘cherry-pick’ the results that have lower P values, and these could well be just chance results. When evaluating a paper the following should always be considered:
In summary, a quantitative study without adequate scientific justification of sample size is usually of poor quality. Any results that are interpreted solely on the basis of P values are relatively uninformative and also unreliable. If we are to make progress in evaluating evidence about CAM, researchers must be rigorous in their use of statistical methods.
The following books contain more detail on this subject: