The very phrase suggesting that not scientific method but rather expediency and sleight of hand guide the scientific process, p-hacking refers to a widespread scientific practice of evaluating many associations in the data generated but only reporting those found to be statistically significant, specifically those found to have a p-value <0.05 ().
A symptom of the major pressure in academia to publish, p-hacking is but one tool in the tool-kit of the notorious tendency aptly named, which is itself part of the far more pervasive and far more consequential , a well-recognized and pronounced tendency for positive, rather than negative or inconclusive, results to get published ( ).
Obviously, selective publication of data thoroughly distorts understanding (),
‘If white swans remain unpublished, reports of black swans cannot be used to infer on general swan color. In the worst case, publication bias means according to Rosenthal (1979) that the 95% of studies that correctly yield nonsignificant results may be vanishing in file drawers, while journals may be filled with the 5% of studies committing the alpha error by claiming to have found a significant effect when in reality the null hypothesis is true.’
As a result, numbers of studies found irreproducible only pile up over time, meaning a problem defined by its chronicity.
p-hacking and other data dredging efforts have become commonplace in biological research because
- Inherently flawed thinking, specifically preoccupation with statistical rather than biological significance has come to dominate biological tests, that their results are or should be binary, either significant (accepted) or nonsignificant (rejected).
- Even though the original intent was they would aid scientists in decision making and/or risk analysis, p-values have came to occupy a central position in biological research because they lend themselves to such artificial dichotomy.
‘Fisher offered the idea of p-values as a means of protecting researchers from declaring truth based on patterns in noise. In an ironic twist, p-values are now often manipulated to lend credence to noisy claims based on small samples” (Gelman & Loken, 2014). And the manipulation can happen completely unintentionally, without the researcher performing any conscious procedure of fishing through the data’
‘In my opinion p -values are one of many systems for looking at data (Senn, 2001). In part their limitations stem from the fact that they are all too often used to summarize a complex situation with false simplicity.’
These are general critiques and observations of experimental biological research. Some of these issues are amplified in microbiota metagenomics studies because
- Many gut microbiota studies do an extremely poor job of estimating error, consequence of two common experimental design flaws. Small sample sizes, i.e., too few biological replicates, compounded by non-existent or too few technical replicates, i.e., number of times different aliquots of the same sample are run A-to-Z through the same technique. Biological and technical replicates are important because they yield measures of inter- and intra-individual variations, respectively.
- Metagenomics is inherently extremely sensitive so discriminating signal to noise is already a challenge. Poorly powered studies simply exacerbate this difficulty, where is defined as the probability that a given test would be significant were an alternative hypothesis true, i.e., rejection of the null hypothesis.
- Since small p-values or significant results require larger sample sizes, many gut microbiota studies with their small sample sizes artificially lend themselves to p-hacking or other data dredging efforts.
- Reality is most microbiota studies are still observational in nature, simply comparing microbiota composition between two sets of people or animals. They are exploratory rather than confirmatory, and thus inherently incapable of weighty inferences, which end up happening anyway. This tendency, , specifically , is partly responsible for ludicrously lofty conclusions.
Thus, p-hacking isn’t so much a bug or glitch to be eliminated through tweaks but rather an essential and inevitable feature of a flawed approach to experimentation science (). How then could study of microbiota metagenomics or study of any other biological phenomena for that matter be improved? Should there be fewer but larger studies or more but smaller ones? Since each study has issues of inherent bias and relative, not absolute, precision, what matters or should matter ( )
‘is not replication defined by the presence or absence of statistical significance, but the evaluation of the cumulative evidence and assessment of whether it is susceptible to major biases’
How could this goal be accomplished though? ()
‘We should design, execute, and interpret our research as a `prospective meta-analysis’ (Ioannidis, 2010), to allow combining knowledge from multiple independent studies, each producing results that are as unbiased as possible’
However, ‘combining knowledge from multiple independent studies‘ is relatively difficult for microbiota metagenomics studies since most of the steps involved in analyzing gut microbiota haven’t yet been standardized (, , , , 12, 13). These include
- Optimal methods to collect samples (lots of confounders to account for including age, gender, diet).
- How to process and store samples (e.g., feces or biopsy, aerobic or anaerobic).
- Optimal choice for DNA extraction.
- Consensus on fool-proof approaches to minimize contamination of reagents, disposables.
- Consensus on controls to assess contamination during DNA extraction and amplification.
- Method to analyze microbiome DNA, metagenomics or 16S rRNA.
- If 16S rRNA, then which variable region(s), which primers and how many PCR cycles to run.
- What sequencing technology to use.
- What bioinformatics tool(s) to use to analyze the data. Type of taxonomic classification, clustering techniques, functional analyses.
And these are only the technical issues! There is also the scientific issue of study design (, , 13, , ). How many experimental groups and how many samples per group, repetitive or one-time. Lack of standardization is a major reason is still in its infancy in microbiota studies. Too many variables and confounders differ between studies to allow for meaningful data comparison. Which techniques lend themselves to particular kinds of biases in resulting data and how best to minimize such biases still remain to be determined ( ).
Such critiques aren’t merely academic but have real-world costs. Consider for example Crohn’s disease, where published studies vary widely in their results (). Microbiota differences between obese and lean individuals have also been difficult to replicate ( ). This is also the case with (IBD) where microbiota metagenomic data alone aren’t yet sufficient to discriminate between healthy and specific IBD states but rather serve to complement other types of diagnostics ( ).
1. Contribution to the discussion of the paper by Stefan Wellek: “A critical evaluation of the current p-value controversy”. Farcomeni, A.Biometrical J, 2017.
2. Smaldino, Paul E., and Richard McElreath. “The natural selection of bad science.” Royal Society open science 3.9 (2016): 160384.
3. Amrhein, Valentin, Fränzi Korner-Nievergelt, and Tobias Roth. The earth is flat (p> 0.05): Significance thresholds and the crisis of unreplicable research. No. e2921v1. PeerJ Preprints, 2017.
4. Gelman A, Loken E. 2014. The statistical crisis in science. American Scientist 102:460465.
5. Senn, Stephen. “Contribution to the discussion of ‘“A critical evaluation of the current p‐value controversy”’.” Biometrical Journal (2017).
6. Wasserstein, Ronald L., and Nicole A. Lazar. “The ASA’s statement on p-values: context, process, and purpose.” (2016).
7. Goodman, Steven N., Daniele Fanelli, and John PA Ioannidis. “What does research reproducibility mean?.” Science translational medicine 8.341 (2016): 341ps12-341ps12.
8. Weiss, Sophie, et al. “Tracking down the sources of experimental contamination in microbiome studies.” Genome biology 15.12 (2014): 564.
9. Bik, Elisabeth M. “Focus: microbiome: the hoops, hopes, and hypes of human microbiome research.” The Yale journal of biology and medicine 89.3 (2016): 363.
10. Weiss, Sophie, et al. “Correlation detection strategies in microbial data sets vary widely in sensitivity and precision.” The ISME journal 10.7 (2016): 1669-1681.
11. Kim, Dorothy, et al. “Optimizing methods and dodging pitfalls in microbiome research.” Microbiome 5.1 (2017): 52.
12. Vandeputte, Doris, et al. “Practical considerations for large-scale gut microbiome studies.” FEMS Microbiology Reviews 41.Supp_1 (2017): S154-S167.
13. Claesson, Marcus J., Adam G. Clooney, and Paul W. O’Toole. “A clinician’s guide to microbiome analysis.” Nature reviews. Gastroenterology & hepatology (2017).
14. Laukens, Debby, et al. “Heterogeneity of the gut microbiome in mice: guidelines for optimizing experimental design.” FEMS microbiology reviews 40.1 (2016): 117-132.
15. Debelius, Justine, et al. “Tiny microbes, enormous impacts: what matters in gut microbiome studies?.” Genome biology 17.1 (2016): 217.
16. Lozupone, Catherine A., et al. “Meta-analyses of studies of the human microbiota.” Genome research 23.10 (2013): 1704-1714.
17. Gevers, Dirk, et al. “The treatment-naive microbiome in new-onset Crohn’s disease.” Cell host & microbe 15.3 (2014): 382-392.
18. Walters, William A., Zech Xu, and Rob Knight. “Meta‐analyses of human gut microbes associated with obesity and IBD.” FEBS letters 588.22 (2014): 4223-4233.