Statistical Significance: A Learning Guide

What You're About to Understand

After working through this guide, you'll be able to design an A/B test from scratch — choosing the right sample size, MDE, and significance level — and explain why each choice matters. You'll spot the most common misinterpretations of p-values (the ones even trained researchers get wrong), and you'll know the right question to ask when someone tells you a result is "statistically significant": significant compared to what, with how much data, and does the effect size actually matter?

The One Idea That Unlocks Everything

Think of statistical significance like a metal detector on a beach. The p-value doesn't tell you "there's gold here." It tells you "the detector beeped." Whether there's actually gold depends on how many gold coins are buried on the beach in the first place (the base rate), how sensitive you set the detector (your significance threshold α), and how deep the detector can reach (your statistical power). A beep on a beach full of bottle caps means something very different from a beep in a known treasure field — even if the detector works identically in both cases.

If you remember only this: a p-value measures how loudly the detector beeps, not the probability that what's buried is gold.

Learning Path

Step 1: The Foundation [Level 1]

Picture a friend who claims they can predict coin flips. You flip a coin 10 times; they get 8 right. Impressive? Or lucky?

Here's how a statistician thinks about it. You start by assuming your friend has no ability — they're guessing (this is the null hypothesis, H₀). Under pure guessing, getting 8/10 right happens about 5.5% of the time. That number — 5.5% — is the p-value. It answers: "If my friend were just guessing, how often would I see results this impressive or more?"

Now, the conventions:

P-value: The probability of seeing data this extreme assuming the null hypothesis is true. It is NOT the probability that the null is true.
Significance level (α): A pre-set threshold, conventionally 0.05. If p ≤ α, you "reject" the null hypothesis and call the result "statistically significant."
Confidence interval (CI): A range around your estimate. A 95% CI is constructed as: point estimate ± 1.96 × standard error. If the CI for a difference excludes zero, that corresponds exactly to p < 0.05 on a two-sided test. They're two views of the same math.
Type I error (false positive): Declaring an effect exists when it doesn't. Rate controlled by α.
Type II error (false negative): Missing a real effect. Rate is β.
Power (1 − β): Your probability of catching a real effect. Convention: aim for ≥ 80%.

Key Insight: There are four linked quantities — effect size, sample size, significance level (α), and power. Know any three, and the fourth is determined. This linkage is the engine behind all experiment design.

MDE (Minimum Detectable Effect) is the smallest effect your experiment can reliably detect given your sample size, α, and power. In CRO terms: if your site converts at 3% and you can only run 36,000 visitors per variant, your MDE is roughly a 10% relative lift. You physically cannot detect anything smaller with confidence. MDE forces honesty about what your experiment can actually tell you.

Check your understanding:
1. A colleague says "p = 0.03, so there's only a 3% chance the null hypothesis is true." What's wrong with this statement?
2. Why can't you just run an experiment "until it's significant"?

Step 2: The Mechanism [Level 2]

Here's where the gears click into place.

Why everything scales with the square root. The standard error of a sample mean is σ/√n. As you collect more data, your estimate gets more precise — but painfully slowly. Double your sample? Precision improves by only a factor of √2 ≈ 1.41. To halve the width of a confidence interval, you need four times the data. To detect an effect half the size, you need four times the data. This √n relationship comes straight from the Central Limit Theorem, and it governs all of frequentist statistics.

Worked example — Designing an A/B test:

Your e-commerce site converts at 3% (baseline p₁ = 0.03). You want to detect a 10% relative lift (to p₂ = 0.033, so absolute MDE = 0.003). You choose α = 0.05 (two-sided) and 80% power.

The sample size formula for two proportions:
n = (Z_{α/2} + Z_β)² × [p₁(1−p₁) + p₂(1−p₂)] / (p₁ − p₂)²

Plugging in: Z_{0.025} = 1.96, Z_{0.20} = 0.842

n = (1.96 + 0.842)² × [0.03 × 0.97 + 0.033 × 0.967] / (0.003)²
n ≈ 7.85 × [0.0291 + 0.0319] / 0.000009
n ≈ 7.85 × 0.061 / 0.000009
n ≈ ~53,200 per variant

At 10,000 visitors/month, that's over 10 months of testing. Now suppose you want to detect a 1% relative lift instead. MDE drops by 10×, so n increases by 100×. You'd need ~5.3 million per variant. This is the √n scaling in action — it's why MDE is the most important practical decision in experiment design.

What the p-value actually computes — and why it's "backwards." Scientists want to know P(hypothesis is true | data). The p-value gives P(data this extreme | hypothesis is true). These are not the same thing. This is like the difference between "the probability a dog has four legs" vs. "the probability that a four-legged animal is a dog." To get from one to the other, you need Bayes' theorem — and crucially, the base rate of true hypotheses in your domain. The p-value framework deliberately ignores this base rate.

The base rate problem in action: Imagine testing 100 drug candidates. 10 actually work. With 80% power and α = 0.05:
- You detect 8 of the 10 real effects (true positives)
- You get ~4.5 false alarms from the 90 duds (90 × 0.05)
- Total "significant" results: ~13. False discovery rate: 4.5/13 ≈ 35%.

Now imagine only 1 in 100 candidates works. False discovery rate jumps to ~86%. Same test, same α, radically different reliability — because the base rate changed.

Check your understanding:
1. You want to halve your MDE but keep the same power and α. What happens to your required sample size, and why?
2. In a field where most hypotheses tested are speculative long shots, should you trust a single p < 0.05 result? Why or why not?

Step 3: The Hard Parts [Level 3]

The Frankenstein framework nobody asked for. The method most researchers use — Null Hypothesis Significance Testing (NHST) — is a historically accidental mashup of two incompatible philosophies.

Ronald Fisher (1925) saw p-values as a continuous measure of evidence. A small p-value meant "worth a closer look," not a binary verdict. Jerzy Neyman and Egon Pearson (1928–33) built a completely different system: pre-set error rates (α, β), define alternative hypotheses, make binary accept/reject decisions. The p-value itself was irrelevant in their framework — only whether it crossed α mattered.

Mid-20th century textbook authors — mostly in psychology and social sciences — merged these into NHST without understanding the philosophical incompatibility. The result is what statistician Gerd Gigerenzer called "an incoherent mishmash." Researchers use Fisher's p-value as if it were Neyman-Pearson's α, treating a continuous evidence measure as a binary decision rule. Neither Fisher nor Neyman would endorse what's taught today.

This historical accident is arguably the root cause of the modern replication crisis.

The peeking problem. If you check your A/B test results repeatedly and stop when p < 0.05, your actual false positive rate isn't 5% — simulations show it can reach 20–30%. Why? Because p-values computed at different sample sizes during the same test are correlated, and repeated looks give you more chances to catch a random fluctuation. Sequential testing methods (O'Brien-Fleming, alpha spending functions) solve this by "budgeting" your total α across planned interim looks — modern platforms like Statsig and GrowthBook implement these.

The winner's curse. Underpowered studies that happen to achieve significance systematically overestimate effect sizes. If the true effect is d = 0.2 but your study only has 20% power, the observed effects that manage to cross the significance threshold will average d ≈ 0.5–0.8. Then the next researcher uses your inflated estimate to power their study, which is now also underpowered, perpetuating the cycle. Median statistical power in neuroscience is just 21% (Button et al., 2013). The published literature in many fields is a funhouse mirror of reality.

The "everything is significant" problem. When your sample is enormous — as in big tech A/B testing or genomics — virtually every effect, no matter how trivial, becomes statistically significant. A drug that lowers blood pressure by 0.1 mmHg can be "highly significant" with n = 100,000, yet clinically meaningless. The entire NHST framework was designed for small-sample science. In the big data era, the concept of statistical significance begins to collapse, and the question shifts entirely to practical significance and effect size.

Check your understanding:
1. A neuroscience paper reports a surprising finding with p = 0.01 from a study of 15 participants. Given what you know about power and the winner's curse, what should you suspect about the reported effect size?
2. Why does pre-registration help with p-hacking, but not fully solve the problem?

The Mental Models Worth Keeping

1. The Metal Detector Model
A p-value is the beep, not the gold. Whether a "significant" finding is actually true depends on the base rate of true hypotheses in your domain — something the p-value doesn't incorporate. Use it when: evaluating any significant result. Ask: "How likely was this hypothesis to be true before the data?"

2. The Square Root Tax
Precision costs quadratically. Halving your uncertainty requires 4× the data. Halving your MDE requires 4× the data. Use it when: scoping experiments, deciding on MDE, or evaluating whether a study was adequately powered. Real example: detecting a 1% relative lift on a 3% conversion rate requires ~3.6 million visitors per variant. A 10% relative lift? ~36,000.

3. The Four-Way Tradeoff
Effect size, sample size, α, and power are locked together. You cannot improve one without paying in another (or increasing sample size). Use it when: designing any experiment. If someone demands higher power AND a smaller MDE without more data, they're asking for a physical impossibility.

4. The Funhouse Mirror
Published significant results from underpowered studies systematically exaggerate effect sizes. The weaker the study, the more inflated the surviving results. Use it when: reading published research. If the study was small, mentally shrink the reported effect toward zero.

5. Fisher vs. Neyman-Pearson — The Incoherent Hybrid
Modern NHST is a confused merger of two incompatible philosophies. Fisher's continuous evidence vs. Neyman-Pearson's binary decisions. Use it when: interpreting your own discomfort with "p = 0.049 is significant but p = 0.051 isn't." That discomfort is correct — it reflects a genuine philosophical incoherence in the method.

What Most People Get Wrong

1. "The p-value is the probability the null hypothesis is true"
Why people believe it: It's the natural reading of "p = 0.03." The correct definition (probability of data this extreme, given H₀ is true) is genuinely counterintuitive — it's backwards from what you want.
What's actually true: The p-value assumes H₀ is true and asks how surprising the data are. To get P(H₀ true | data), you need Bayes' theorem and a prior.
How to tell: If someone says "there's only a 3% chance this is due to chance," they've made the inversion error.

2. "Non-significant means no effect"
Why people believe it: The language of "failing to reject" sounds like "accepting the null." And p > 0.05 is often reported as "no difference found."
What's actually true: Non-significance could mean there's no effect, OR the study was simply too small to detect one. Absence of evidence ≠ evidence of absence.
How to tell: Check the confidence interval. If it's wide enough to include both "no effect" and "large effect," the study was uninformative, not negative.

3. "Smaller p means bigger effect"
Why people believe it: P-values feel like a strength-of-effect meter. And within a single study design, smaller p does correlate with larger observed effects.
What's actually true: P-values are equally influenced by sample size. A tiny, meaningless effect with n = 1,000,000 can produce p < 0.0001. A large, important effect with n = 10 can produce p = 0.15.
How to tell: Always look at the effect size and confidence interval alongside the p-value. If only p is reported, be suspicious.

4. "A 95% CI has a 95% probability of containing the true value"
Why people believe it: "95% confidence" sounds exactly like that. Even most researchers interpret it this way.
What's actually true (in frequentist terms): The true value is fixed — it's either inside or it isn't. The 95% refers to the procedure: if you repeated the study infinitely, 95% of the intervals would contain the truth. Any specific interval either does or doesn't. (Bayesian credible intervals do have the intuitive interpretation people want.)
How to tell: If someone treats a single CI as a probability statement about the parameter, they're using the Bayesian interpretation — which is fine for Bayesian credible intervals, but technically incorrect for frequentist CIs.

5. "If two confidence intervals overlap, the difference isn't significant"
Why people believe it: It seems visually obvious — overlap means the ranges agree.
What's actually true: Overlapping CIs can coexist with a statistically significant direct comparison (p < 0.05). The CI for a difference is not the same as comparing two individual CIs.
How to tell: Always run a direct comparison test rather than eyeballing CI overlap.

The 5 Whys — Root Causes Worth Knowing

Why is p < 0.05 the standard?
Fisher proposed it in 1925 → because 1.96 standard deviations ≈ 5% tail probability (a round number) → because he needed practical tables for researchers computing by hand → because textbooks and journals institutionalized it → because changing an entrenched norm requires coordinated action across thousands of institutions, and each actor individually benefits from the status quo.
- Level 2 deep: The costs of false positives are diffuse and delayed; the costs of changing are concentrated and immediate. Classic collective action problem.
- Level 3 deep: This is a tragedy of the commons — the statistical equivalent of overfishing.

Why do researchers misinterpret p-values?
NHST conflates two incompatible frameworks → textbooks teach the hybrid without explaining the inconsistency → textbook authors are themselves products of confused training → statistics is taught as recipes rather than reasoning → demand for training far exceeds supply of people who understand the foundations.
- Level 2 deep: Compression removes the nuances that prevent misinterpretation, leaving a clean but wrong mental model.
- Level 3 deep: The wrong model ("p = probability hypothesis is false") persists because it matches the question scientists actually want to answer. The correct interpretation is deeply unintuitive because human cognition is naturally Bayesian.

Why does the replication crisis exist?
Many published findings are false positives or have inflated effects → p-hacking + publication bias + underpowered studies → academic incentives reward novel significant findings → tenure/funding/prestige allocated by publication count → the system optimizes for individual productivity rather than collective accuracy.
- Level 2 deep: The harm of false positives is diffuse; the benefit of publication is concentrated.
- Level 3 deep: Those with power to change the system are its beneficiaries. Classic institutional inertia.

The Numbers That Matter

1. α = 0.05 — The conventional threshold. A 1-in-20 chance of a false positive per test. Chosen by Fisher for convenience, not because it's mathematically special. To put it in perspective: if you run 20 independent tests, you have a 64% chance of at least one false positive.

2. Median power in neuroscience: 21% (Button et al., 2013). That means most neuroscience studies have less than a 1-in-4 chance of detecting the effects they're looking for. It's like going bird-watching with binoculars so bad you'd miss 4 out of 5 birds.

3. Replication rate in psychology: 36% (Open Science Collaboration, 2015). 97% of the original studies reported p < 0.05. Only about a third held up. Mean effect sizes halved on replication.

4. The 4× rule. To halve your CI width: 4× the sample. To halve your MDE: 4× the sample. To go from α = 0.05 to α = 0.005 at equivalent power: 70% more data. This is the square root tax making everything expensive.

5. The base rate trap. Testing 100 drug candidates where 10 work (with 80% power, α = 0.05): ~35% of your "significant" results are false positives. Testing where only 1 in 100 works: ~86% false positives. Same statistical test, wildly different reliability.

6. MDE benchmarks in CRO. High-traffic mature organizations: 1–2% relative MDE. Typical organizations: 2–5%. Low-traffic sites: 5–10%+. If your site gets 10,000 visitors/month at 3% conversion, detecting a 1% relative lift would take ~60 years.

7. The FDA's effective α ≈ 0.0025. By requiring p < 0.05 from two independent confirmatory trials, the FDA's actual false positive rate is 0.05 × 0.05. That's 50× more stringent than a single test at 0.05.

8. P = 0.05 corresponds to a Bayes factor of only 2.5–3.4 in favor of the alternative hypothesis. In Bayesian terms, that's "weak" to "anecdotal" evidence. This mismatch is why Benjamin et al. (2018) proposed lowering α to 0.005.

Where Smart People Disagree

Should we ban the phrase "statistically significant"?
The 2019 American Statistician editorial (Wasserstein, Schirm, Lazar) argued yes — binary thinking causes more harm than good. The counter: binary decisions are sometimes genuinely needed (approve drug or not, ship feature or not), and removing thresholds without a clear replacement creates chaos. The ASA's own task force later clarified the editorial was not official ASA policy. Unresolved because the field hasn't agreed on what should replace the binary framework.

Should α be lowered to 0.005?
Benjamin et al. (2018), backed by 72 prominent researchers, argued this would reduce false positives by ~80% and better align with Bayesian evidence standards. Critics counter: it's still arbitrary, would require 70% larger samples (devastating for fields that can't afford them), and doesn't address the real problem — the fixation on thresholds rather than estimation and effect sizes. Unresolved because it pits reducing false positives against feasibility of research.

Bayesian vs. Frequentist — which framework should dominate?
Bayesians argue their framework answers the right question (P(hypothesis | data)), naturally incorporates prior knowledge, gives intuitive credible intervals, and has no peeking problem. Frequentists counter: no subjective priors needed, clear error rate guarantees, established regulatory frameworks, and computational simplicity. With flat priors, both often give similar results. Many modern statisticians are pragmatists who use both. Unresolved because the disagreement is ultimately philosophical — about the meaning of probability itself.

Is NHST salvageable?
Reformers say yes, with better education, pre-registration, and required reporting of effect sizes and CIs. Revolutionaries say no — the framework is fatally flawed and should be replaced entirely with Bayesian, likelihood-based, or estimation approaches. Pragmatists say it's one tool among many. Unresolved because replacing a deeply entrenched system requires coordinated action from thousands of journals, funding agencies, and training programs.

What You Don't Know Yet (And That's OK)

After this guide, you understand how statistical significance works mechanically, why it's so often misused, and how to design and interpret experiments thoughtfully. Here's where your knowledge runs out:

Always-valid inference: E-values and confidence sequences allow continuous monitoring without inflating false positive rates — a genuine breakthrough for online experimentation. This is emerging and may reshape how A/B testing works.
Optimal multiple comparison correction: Bonferroni is too conservative, FDR may be too liberal, and choosing the right method is itself a judgment call that depends on domain-specific error costs. No universal solution exists.
Combining dependent p-values: Fisher's method assumes independence. Real studies share data, methods, and researchers. How to properly combine correlated evidence is an active research area.
The base rate estimation problem: Knowing the false discovery rate in a field requires estimating the proportion of hypotheses that are actually true — and nobody has a reliable way to do that.
How to make statistics education stick: The core pedagogical problem — teaching a framework that is inherently counterintuitive without creating systematic misunderstanding — remains unsolved.
Causal inference: Understanding that an effect is statistically significant is very different from understanding whether it's causal. Methods like difference-in-differences, instrumental variables, and regression discontinuity represent a whole adjacent field.

Subtopics to Explore Next

1. Bayesian Statistics and Credible Intervals
Why it's worth it: Unlocks the interpretation that people want from confidence intervals, and gives you a framework for incorporating prior knowledge formally.
Start with: "Bayesian A/B testing" — search for how companies like VWO report "probability of being best" instead of p-values.
Estimated depth: Medium (half day)

2. Power Analysis and Sample Size Calculation (Hands-On)
Why it's worth it: Turns you from someone who understands the theory into someone who can actually design experiments with the right sample sizes.
Start with: Use G*Power software or an online calculator. Try computing sample sizes for your own conversion rates and desired MDEs.
Estimated depth: Medium (half day)

3. Sequential Testing and Alpha Spending
Why it's worth it: Solves the peeking problem — the gap between how experiments are supposed to run (fixed sample) and how they actually run (continuous monitoring).
Start with: "O'Brien-Fleming alpha spending function" and how modern A/B platforms implement it.
Estimated depth: Medium (half day)

4. The Replication Crisis — In Depth
Why it's worth it: Transforms abstract statistical concepts into a concrete narrative about how science went wrong and what's being done about it. Makes the stakes of misusing statistics visceral.
Start with: Ioannidis (2005), "Why Most Published Research Findings Are False" — the most-cited methodology paper ever.
Estimated depth: Medium (half day)

5. Multiple Comparisons (Bonferroni, FDR, Benjamini-Hochberg)
Why it's worth it: Essential for anyone running multiple metrics or variants in experiments. Without this, you'll generate false positives at alarming rates.
Start with: "Benjamini-Hochberg procedure tutorial" — the practical workhorse of modern multiple testing.
Estimated depth: Surface (1–2 hours)

6. Effect Sizes and Practical Significance
Why it's worth it: The missing half of any significance result. Knowing that an effect exists matters less than knowing how big it is.
Start with: Cohen's d, and the concept of "minimum clinically important difference" in medicine or "minimum worthwhile effect" in CRO.
Estimated depth: Surface (1–2 hours)

7. Causal Inference Methods
Why it's worth it: Moves you from "is there a significant association?" to "did X actually cause Y?" — the question that usually matters.
Start with: The concept of "identification strategy" and difference-in-differences.
Estimated depth: Deep (multi-day)

8. E-values and Always-Valid Inference
Why it's worth it: Possibly the future of hypothesis testing. E-values can be accumulated across studies, allow optional stopping, and bridge Bayesian and frequentist viewpoints.
Start with: Grünwald et al., "Safe Testing" — the foundational paper on e-values.
Estimated depth: Deep (multi-day)

Key Takeaways

A p-value tells you how surprising the data are under the null hypothesis — it does NOT tell you the probability the hypothesis is true. Confusing these is the single most consequential error in applied statistics.
The 0.05 threshold is a historical convenience, not a law of nature. Fisher called it "convenient" in 1925, and the field has been stuck with it ever since.
Modern NHST is a Frankenstein merger of Fisher's and Neyman-Pearson's incompatible frameworks. Neither creator would endorse what's taught in textbooks today.
Precision scales with the square root of sample size. This "square root tax" makes detecting small effects extraordinarily expensive and is the fundamental constraint of experiment design.
A significant result from an underpowered study will almost certainly overestimate the true effect size. The weaker the study, the bigger the exaggeration of surviving results.
The reliability of a significant finding depends critically on the base rate of true hypotheses in your domain — something the p-value doesn't incorporate and most researchers don't consider.
Statistical significance and practical significance are completely different things. With enough data, trivial effects become "highly significant."
"Non-significant" does not mean "no effect." It means you didn't collect enough evidence — possibly because you didn't collect enough data.
Confidence intervals contain strictly more information than p-values. They show both the estimate and the uncertainty, making it harder to confuse a tiny effect with an important one.
In CRO and A/B testing, MDE is the most important practical decision. It determines how long your test runs and what effects you can actually learn about.
Pre-registration constrains p-hacking but can't eliminate it. With 10 binary analytical choices, there are 1,024 possible analyses of the same dataset.
If you peek at A/B test results repeatedly without sequential testing corrections, your real false positive rate can reach 20–30%, not the 5% you think you have.
Bayesian credible intervals give the interpretation people want from confidence intervals — an actual probability statement about the parameter. When someone says "95% chance the true value is in this range," they're being a Bayesian, whether they know it or not.
The replication crisis is not a failure of statistics — it's a failure of incentives. The tools work fine when used as designed; the problem is that the academic system rewards misuse.

Sources Used in This Research

Primary Research:
- Open Science Collaboration (2015) — large-scale replication of 100 psychology studies (36% replication rate)
- Benjamin et al. (2018) — proposal to redefine statistical significance at α = 0.005 (72 co-authors)
- Ioannidis (2005) — "Why Most Published Research Findings Are False"
- Button et al. (2013) — power analysis of neuroscience studies (median power 21%)
- Cohen (1962) — power analysis of psychology studies (median power 48%)
- Greenland et al. (2016) — 25 documented misinterpretations of p-values and CIs
- Jager & Leek — empirical false positive rate estimate (~14%) in biomedical literature

Expert Commentary:
- ASA Statement on P-values (2016) — six principles on proper use
- Wasserstein, Schirm & Lazar (2019) — American Statistician editorial proposing to retire "statistical significance"
- Geoff Cumming — "The New Statistics" movement (effect sizes + CIs over NHST)
- Gigerenzer — critique of NHST as "an incoherent mishmash"
- Meehl — paradox of significance testing across hard vs. soft sciences

Historical / Reference:
- Fisher (1925) — Statistical Methods for Research Workers
- Fisher (1935) — The Design of Experiments (Lady Tasting Tea)
- Neyman & Pearson (1928–1933) — alternative hypothesis framework, power, Type I/II errors
- Karl Pearson (1900) — chi-squared test, early p-value formulation

Emerging / Frontier:
- E-values and always-valid inference (confidence sequences)
- Second-generation p-values (SGPVs) — interval null hypothesis approach
- Multi-armed bandits as alternatives to traditional hypothesis testing