By Max Korbmacher
If you want to join Max as he reads through this blogpost, check out this YouTube video.
There are studies of varying quality. Obviously, you would want to take your information only from the high-quality studies. But how to differentiate between studies? Here are some tips on what to look out for when assessing what’s a good study (and what’s a bad one).
As a psychology student, sooner or later, you will hear about the replication crisis (see e.g., Schimmack, 2020). To sum up some core evidence we begin where it hurts: the Open Science Collaboration, short OSC (2015), found a pooled replication rate of 36% when looking at 100 social and cognitive psychology studies. Effect sizes of the replicated findings were reduced by 51%. When differentiating between psychological fields in the OSC data and using p<.05 in the replication study as criterion for successful replication, 25% of the social and 50% of the cognitive psychology findings replicated. Additionally, Schimmack (2020) estimated the replicability of social psychology to be between 20% and 45%.
Figure 1. Replicability in Psychology: gathered evidence from the last decade.
Combining these 100 replications by the OSC with 207 other replications of various projects conducted in the recent decade it looks a little better: 64% of the studies replicated but effect sizes were reduced by 32% (Nosek et al., 2021).
The outstanding question is: which findings can be trusted? This is important, because we would like to do research that produces reliable knowledge and interventions that actually work. For example, it would be nice to be sure that concentrated efforts to reduce racism are actually making a difference.
There are many different ways to assess the quality of a given study and a one-size-fits-all guide would be too simplistic. However, as a start, here is a list with some useful points to look out for.
Inspect the Study’s Basic Idea
Most articles are built upon one or several basic ideas. Those might be causal effects, eventually articulated in pre-existing theory, or simply curiosity to observe an expected phenomenon. Many ideas which were thought of as weird have led to scientific breakthrough (pick and choose from the Ig Nobel Prizes). Others may just be weird ideas. If the idea is too strange to be true, you might be right to get suspicious. For example, seeing into the future is a presumably highly unlikely skill. Still, studies have been published (e.g., Bem, 2011) seeming to support empirical evidence for psychic skills in university students. Luckily, some researchers got suspicious and questioned the alleged published evidence (Francis, 2012; Schimmack, 2012). The results of investigating Bem’s findings were saddening as they suggest the use of questionable research practices (QRPs). Bem simply reported too many findings just below the magical p=.05 line, which is statistically extremely improbable (Schimmack, 2012). It is great to work with bold claims, but some things are too good to be true. Indeed, recent evidence suggests that the more surprising the original study’s findings, the less likely they are to replicate (Hoogeveen, Sarafoglou & Wagenmakers, 2020).
A good article should give a fair representation of the literature and allow for diverging ideas and discussion. More data is often better in order to get a broader understanding of the field but sometimes authors fail to report important studies. That might give a skewed representation of the field, which will also influence the perception of the study’s basic idea (point 1) and the following points. That being said, it becomes clear that it is always better to read broadly when evaluating a study instead of blindly trusting the one single study at hand.
Clearly, there are phenomena which are new, meaning there might not yet be much literature on the specific matter. In that case, it is even more important to clarify concepts, definitions and operationalization and then discuss the context of the study. A perfect example, where this did not work out is a paper which became known as the “The Hot-Crazy Matrix paper”. Here, the authors based their idea on a YouTube video which propagates sorting women into the hot-crazy matrix and men into the cute-rich matrix (see figure 2).
Figure 2. Sexist matrices. Left panel: the “hot-crazy matrix”. Right panel: the “cute-rich matrix”.
This categorisation into the matrices was presented as a universal phenomenon. As Bik (2021) notes, beside other issues, this particular representation gives the impression that the hot-crazy matrix is scientific consensus, although there is no scientific evidence supporting the matrices. Hence, checking the literature the paper at hand is embedded in can pay off. However, only because there are many papers supporting a certain claim does not mean that they are of high quality. Replications of an effect in multiple studies can however give some more certainty, if the studies’ methods are robust. In this sense, it can be good to check who replicated the effect, and it is preferable to have different labs produce replications of the same effect. The reason is that self-replications are problematic when the same QRPs are applied in the replication as in the original study (here an example).
The hypotheses are incredibly important, as they tell you about the researchers’ intentions of what they wanted to test. Badly set-up hypotheses should definitely be a warning sign. Bad hypotheses are unclear or cannot be measured (e.g., “Sometimes Ryan and Cloe differ in liking eating eggs.”). A good hypothesis is specific and testable (e.g., “In the morning, Ryan likes eating eggs better than Cloe, and in the evening, Cloe likes eating eggs more than Ryan.”).
Another point is that it is important that hypotheses are not developed based on the same data on which they are tested (so: develop the hypothesis >> then test it). Everything else qualifies as QRPs. This is the reason why you can call yourself surprised, if the data suggest the rejection of your null hypothesis (p<.05). Setting up hypotheses after the results are known would not be surprising. This has been nicely explained in a paper by Feynman already in 1988, describing the conversation with a colleague about the mentioned problem:
[I]t’s a general principle of psychologists that in these tests they arrange so that the odds that the things that happen by chance is small, in fact, less than one in twenty… And then he ran to me, and he said, “Calculate the probability for me that they should alternate, so that I can see if it is less than one in twenty.” I said, “It probably is less than one in twenty, but it doesn’t count.” He said, “Why?” I said, “Because it doesn’t make any sense to calculate after the event. You see, you found the peculiarity, and so you selected the peculiar case.” . . . If he wants to test this hypothesis, one in twenty, he cannot do it from the same data that gave him the clue. He must do another experiment all over again and then see if they alternate. He did, and it didn’t work.(Feynman, 1998 cited in Gigerenzer, 2004, p.17)
To be sure that the hypotheses were not set up after the results are known, check whether the study was pre-registered (here more on pre-registration). Surely, there are plenty of studies which are not pre-registered which did not apply QRPs and also pre-registrations do not necessarily protect from QRPs. However, QRPs seem to be less prevalent in pre-registered studies (Schimmack, 2020) and pre-registered studies seem to produce better estimates of effects than non-pre-registered studies (e.g., Schäfer & Schwarz, 2019; Strømland, 2019).
The methods section usually starts with describing the participants. Just by looking at the participants’ demographics, you will likely be able to draw some conclusions about the study’s generalisability. If the authors look for an effect which is highly influenced, for instance, by culture, it might be a good idea to conduct cross-cultural studies, etc.
When looking at the procedures as the reader of the finished article, strong ethical violations are unlikely to be visible. This is what ethics committees are for. However, there are a range of factors which can mess up the experiment and be unethical – for example, making participants uncomfortable with insensitive questions without being informed about the possibility that such questions could be asked in the consent form.
Another problem is data collection methods which deliver inaccurate results. For example, recording heart rate at a low sampling rate might produce uninterpretable results. The same applies for questionnaires which are not validated; it will be difficult to interpret the results as one cannot be sure what has been measured. A research finding will only tell you as much as the measures allow (figure 5). Hence, make sure that validity statistics are reported, such as McDonald’s omega or Cronbach’s alpha – not only for some of the scales, but best for all of them including sub-scales. Those give good information about measurement error (table 1).
Table 1. Lower Reliability leads to a higher error rate
Taking Cronbach’s alpha as an example, when having alpha = .8, which lays within the range of values which have been labelled acceptable (.7 to .95) in the literature, there is still .36 random error (see Tavakol & Dennick, 2011 for the calculation: .8^2=.64; 1-.64=.36). For the lower bound of this ‘acceptability range’, alpha=.7, the random error will already be as high as .51. In other words, low reliability measures add noise to the data. A p<.05 might not tell much when measures are unreliable and sample size low, as also small effect sizes can become significant when testing on noisy data (Loken & Gelman, 2017).
Finally, confounding variables can be a real headache, as they can reduce the predictive power of a model. Maybe you can come up with a good confound when the model presented in the study is not fitting? A great example is the positive correlation between ice cream sales and homicide rates (I heard of it the first time in the Quantitude Podcast). Eating ice cream does not increase violence (watch out, it’s a correlation!), however, heat seems to do so, to a certain degree (Anderson, 2001).
1 Were the right statistical procedures used?
To be able to evaluate the findings of a study it is important to understand the statistical procedures. It can surely happen that inappropriate statistical tests were applied, or crucial steps not considered – for example, correcting for multiple testing (when testing, often enough you will observe significant findings by chance, even when there is no true effect).
2 Does the data processing make sense or were the data ‘beaten’ until they gave in for a p<.05?
What is much more difficult to evaluate is whether QRPs were used. One cannot be sure whether all tests were reported, and all data used, if the study was not pre-registered. There should be a logical rationale for data processing and exclusion of outliers. Inconsistencies, such as excluding participants without apparent reason, should be a warning sign.
3 Are the statistical tests sufficiently powered?
Still, psychology’s biggest problem child is the power of statistical tests. Power refers to the probability of detecting an effect, if there is a true effect present. Most tests in psychology are applied on small samples. Depending on the design, this can lead to low power. For example, the mentioned difference in replication rates in the OSC (2015) study between social and cognitive psychology is partly due to better powered designs of the selected cognitive psychology studies (here an introduction video to power).
Figure 3. Sensitivity analyses for simple within and between groups designs, assuming normally distributed data using two-tailed t-tests in G*Power. The top figure is a within subjects (paired-samples) t-test, the bottom figure is a between subjects (independent samples) t-test.
Assume we test 100 students and use common parameters in social psychology testing: 80% power and an alpha level of .05. Running an independent samples t-test with equally sized groups, we will only be able to detect effects with a minimum size of d=.57 (figure 3, right side), which is larger than the expected average mean effect in social psychology of d=.4. Testing the same 100 participants twice and running a paired samples t-test with the same described assumptions allows to detect smaller effects down to a minimum of d=.28 (figure 3, left side).
While many social psychology experiments compare between subjects, within-subjects designs seem to be more common in cognitive psychology. And that is not necessarily surprising, because, for example, priming the same person twice in the same session does rarely make sense. Contrarily, testing how many digits participants can remember (e.g., 4, 5 and 6), can be done with the same person.
Additionally, when testing not only for differences (main effects) between groups, but also differences between differences (interaction effects), suddenly a multiple of the sample size is required to achieve appropriate power (Gelman, 2018).
Furthermore, low power does not allow to detect effect sizes over a certain threshold. Problematically, most journals strongly prefer to publish positive findings. As a result, negative results disappear into the file drawer to never see the light of day again. The result is a biased literature or ‘publication bias’. Low power design studies combined with reporting only significant findings can lead to a collection of inflated effects in the literature, which is even more problematic when QRPs are applied (Ioannidis, 2008). And this is why most replication efforts produce smaller or no effects compared to the original studies. Now, what to make of this when looking at a paper? You can make a sensitivity analysis to calculate the maximum observable effect size based on the parameters in the paper in different programs, such as G*Power. If the observed effect size is too high to be observed based on the power, it is inflated. For more on sample size justification and power see Lakens (2021).
4 Are there loads of p-values just below .05?
If a paper reports p-values from several studies, you can test whether their distribution is plausible by using the test of insufficient variance (TIVA), but is recommended to be used together with other variance tests (Schimmack, 2014). If there are sufficient results present in the literature, and you want to be thorough, you could run a p-curve or z-curve analysis. Those plot the distribution of either p- or z-values to model an expected distribution of findings, considering publication bias. By doing so, you can often notice a spike in the distribution just below the golden p=.05 line. Such bias of the literature suggests the application of QRPs (see figure 4).
Figure 4. A questionable literature
Left panel: A) Black line shows distribution of p-values when there is no evidential value and the red line shows how p-hacking influences this distribution. B) Black line shows distribution of p-values when there is evidential value and the red line shows how p-hacking influences this distribution. Tests for p-hacking often compare the number of p-values in two adjacent bins just below .05.
The right panel displays the distribution of z-values (higher z-values >> lower p-values). Top: Z-curve of 120 Psychology Journals 2010-2020, see Schimmack’s (2020) blog post for an overview. Bottom: Z-curve of results published in the Journal of Consumer Research
Note: Red dotted lines in the right panel illustrations display the location of p=.05 with most findings are just below this line.Source: https://twitter.com/R__INDEX/status/1361352109501935619/photo/1
The most commonly-made mistake in papers is that correlation is interpreted as causation. Probably the most prominent recent example is the now retracted article “The association between early career informal mentorship in academic collaborations and junior author performance”, which used repeatedly causal language describing correlations.
Yet, this is only one of many examples where verbal claims do not logically follow from presented inferential statistics. Hence, Yarkoni (2019) has stated that psychological claims are often not generalisable and the field is in a generalisability crisis. So, double-check whether hypotheses and statistical tests align in a meaningful way, and whether the conclusions drawn from both make sense (figure 5). Often enough authors oversell their claims which gets the paper published, so watch out for that one too. An example is the overestimated influence of political bubbles or echo chambers (see Dubois & Blank, 2018; Eady et al., 2019).
Figure 5. Research is only as good as its measures.
In this article, I presented a list of red flags and strategies which can be helpful to think about when assessing a paper’s credibility, namely the 1) basic idea, 2) presented context, 3) hypotheses, 4) methods, 5) analyses, and 6) inferences. It clearly arises that a multitude of factors plays into the quality of a research paper, including information beyond the single paper. The procedure of assessing a paper’s credibility can hence hardly be generalised and there is always more to be learned when assessing papers. Nevertheless, with this article, I hope to provide a starting point for a critical reading of the literature.
- CONSORT website offer a checklist and a flow chart to assess experimental trials.
- The Critical Appraisal Tools website from the University of Adelaide provides different checklists for different types of studies and texts.
- The CASP Checklists are in-depth checklists as they add a qualitative (open question) part.
- APA journal article reporting standards to see what should be included in a journal article following the standards of the American Psychological Association.