Frequentist vs. Bayesian Testing: A Learning Guide

What You're About to Understand

After working through this guide, you'll be able to choose the right statistical framework for an A/B test and explain why to a skeptical colleague. You'll spot the moment someone misinterprets a p-value (it happens constantly, even among PhDs), diagnose whether a "significant" result actually constitutes strong evidence, and know exactly when sequential testing saves you from wasting traffic — or from fooling yourself. You'll also have the vocabulary to engage with the frontier: e-values, the FDA's new Bayesian guidance, and why the supposedly bitter frequentist-Bayesian war is increasingly a false dichotomy.

The One Idea That Unlocks Everything

The courtroom analogy. Think of frequentist and Bayesian testing as two different legal systems for the same crime.

The frequentist system works like a strict procedural court: "Assuming the defendant is innocent, how surprising is the evidence?" If the evidence would be very unusual under innocence (p < 0.05), we reject innocence. But notice — this system never tells you the probability the defendant is guilty. It only tells you how weird the evidence looks if they're not.

The Bayesian system works like a court that also considers the defendant's background: "Given everything we knew before and the new evidence, what's the probability the defendant is guilty?" It gives you a direct answer to the question you actually want — but it requires you to state your prior beliefs on the record, and those beliefs affect the verdict.

Sequential testing is like a judge who checks in on the trial periodically and can call it early if the evidence is overwhelming — but must follow strict rules to avoid making premature judgments look legitimate.

If you remember only this: frequentists tell you P(evidence | innocence); Bayesians tell you P(guilt | evidence). These are inverse conditionals. Confusing them is the single most common statistical mistake in the world.

Learning Path

Step 1: The Foundation [Level 1]

Imagine you run an e-commerce site and test a new checkout button. After two weeks, your tool reports "p = 0.03." What does that actually mean?

The frequentist answer: "If the new button had zero real effect, there's only a 3% chance you'd see data this extreme." You set a threshold (usually 5%), and since 3% < 5%, you "reject the null hypothesis" and declare the button a winner. You also get a confidence interval — say, the conversion lift is between +0.5% and +3.2%. But that interval does not mean there's a 95% probability the true lift is in that range. It means: if you repeated this experiment forever, 95% of intervals built this way would contain the truth. This specific interval either does or doesn't — you just don't know which.

The Bayesian answer: You start with a prior belief about checkout button effects (maybe from dozens of previous tests: most changes do nothing, a few lift 1-5%). You combine that prior with your data via Bayes' theorem: posterior ∝ likelihood × prior. The result is a probability distribution over the true effect. Your tool might say: "92% probability that the new button is better, with a credible interval of +0.3% to +2.8%." That 92% means exactly what you think it means — given the data and your priors, there's a 92% chance this button actually works.

Sequential testing enters when you can't resist checking results before the test ends (and be honest — you can't). In a standard frequentist test, every peek inflates your false positive rate. Peek 10 times and your "5% significance level" is actually ~19%. Sequential methods — SPRT, group sequential tests, mSPRT — let you look as often as you want while controlling the overall error rate.

The core mechanics:

	Frequentist	Bayesian
Probability means	Long-run frequency	Degree of belief
Parameters are	Fixed but unknown	Random variables
Data is	Random	Fixed (once observed)
You get	P(data \| hypothesis)	P(hypothesis \| data)
Prior needed?	No	Yes
Stopping rule matters?	Yes, critically	Theoretically no

Check your understanding:
1. A colleague says "We got p = 0.04, so there's a 96% chance the variant is better." What exactly is wrong with this statement, and what can you legitimately conclude?
2. Why does a confidence interval not give you a probability statement about the parameter, while a credible interval does?

Step 2: The Mechanism [Level 2]

The surface-level differences above aren't arbitrary quirks. They flow from a single philosophical fork: what is probability?

For frequentists, probability is the long-run frequency of an event in repeated trials. You can talk about the probability of getting heads because you can flip a coin a million times. But you cannot talk about the "probability" that a parameter equals 0.7 — the parameter is a fixed number, not something that varies across trials. This is why frequentists can never say "the probability the hypothesis is true." Hypotheses don't have frequencies.

For Bayesians, probability is a degree of belief. You can assign probability to anything uncertain — hypotheses, parameters, whether it will rain tomorrow. This lets Bayesians answer the question you actually want ("how likely is this hypothesis?"), but at a cost: you must specify what you believed before seeing data.

The Likelihood Principle — why stopping rules cause trouble

Here's a worked example that makes the problem visceral. Suppose you run a coin-flipping experiment:

Design A: "I'll flip 100 times and count heads." You get 60 heads.
Design B: "I'll flip until I get 60 heads." It takes 100 flips.

The data is identical: 60 heads in 100 flips. But the p-values are different because the sample space — the set of outcomes you could have observed — differs between designs. Design A could have produced any number of heads from 0-100. Design B could have required any number of flips from 60 to infinity. Frequentist inference depends on data you never observed.

Bayesian inference obeys the likelihood principle: the likelihood function is the same in both designs, so the posterior is identical. The reason you stopped doesn't matter — only the data you actually saw.

Key Insight: This is exactly why peeking is a problem for frequentists. When you peek and stop early, you've changed the stopping rule, which changes the sample space, which changes the p-value. The data hasn't changed, but the frequentist's answer has.

The peeking problem — a quantitative look:

Under the null hypothesis, the test statistic follows a random walk. A random walk will eventually cross any fixed boundary given enough time. So if you keep checking at α = 0.05:

1 peek: 5% false positive
5 peeks: ~14%
10 peeks: ~19%
100 peeks: ~30%+
Continuous monitoring: approaches 100% over infinite time

This is mathematically inevitable — it follows from the optional stopping theorem for martingales.

How sequential testing fixes it:

The fix is elegantly simple: make the boundary move. Instead of a fixed threshold (p < 0.05 every time you look), use a threshold that becomes harder to cross at early looks and relaxes toward the end. This is called an alpha spending function.

Two major approaches:
- O'Brien-Fleming: Very strict early (practically impossible to stop at first look), relaxes to nearly 0.05 at the final look. Preserves almost all the power of a fixed-sample test.
- Pocock: Same threshold at every look (~p < 0.016 with 5 analyses). Easier to stop early, but less power at the end.

Abraham Wald's Sequential Probability Ratio Test (SPRT), developed during WWII, proved something remarkable: it is provably optimal — no other sequential test achieves the same error rates with a smaller expected sample size. It reduces the average sample needed by 40-50%.

Check your understanding:
1. Two researchers analyze the same dataset. Researcher A planned to collect exactly 200 observations. Researcher B planned to stop when they got a significant result. Under frequentist analysis, can they get different p-values from identical data? Under Bayesian analysis? Why?
2. Why does the O'Brien-Fleming spending function preserve more power than the Pocock boundary?

Step 3: The Hard Parts [Level 3]

Here's where the clean narrative breaks down.

"Bayesian testing is immune to peeking" — the myth that won't die

Many A/B testing platforms market Bayesian methods as "peek-proof." The logic sounds airtight: the likelihood principle says stopping rules don't affect inference, so Bayesian posteriors are valid no matter when you stop.

But there's a critical distinction between inference and decisions. David Robinson's simulations showed that a Bayesian procedure with daily peeking — stopping when P(variant B is better | data) > 0.95 — inflated the false positive rate from 2.5% to 11.8%. A 4x increase.

Why? Because the posterior probability is valid at every moment. But a decision rule like "stop when posterior > 0.95" creates multiple chances to cross the threshold by luck, just like frequentist peeking. The posterior being correct doesn't prevent the decision from being wrong more often than expected.

Key Insight: With uninformative priors, Bayesian decision procedures are just as vulnerable to peeking as frequentist ones. Informative priors help (they're skeptical of large effects, providing a natural brake) but don't eliminate the problem. The platform marketing is technically accurate and practically misleading.

Lindley's Paradox — when the frameworks give opposite answers

With a large enough sample, you can get a statistically significant p-value (p < 0.05) while the Bayes Factor strongly favors the null. At n = 10,000, a p-value of 0.049 might correspond to a Bayes Factor of 5:1 in favor of H₀.

Why? The p-value keeps α fixed regardless of sample size. But the Bayes Factor penalizes the alternative hypothesis more as n grows — if you've collected 10,000 observations and the effect is barely significant, that's actually evidence that the effect, if it exists, is too small to matter. The Bayesian approach naturally embeds this logic; the frequentist approach does not.

The "Frankenstein" problem in practice

The textbook version of frequentist testing that most people use was never endorsed by any of its founders. Fisher developed p-values as continuous measures of evidence — he never intended a fixed 0.05 cutoff. Neyman and Pearson developed the reject/accept framework — but they didn't use p-values. The hybrid (use p-values as evidence, within a reject/accept framework, at 0.05) was cobbled together by textbook authors, and both Fisher and Neyman would have rejected it.

This means most practitioners are using a framework with no coherent philosophical foundation. They're treating p-values as measures of evidence (Fisher's idea) while using binary thresholds (Neyman-Pearson's idea) — combining the limitations of both approaches with the strengths of neither.

The e-value revolution

A new framework called "testing by betting" (Shafer, Vovk, Ramdas, 2019-2025) reframes hypothesis testing as a gambling game. You bet against the null hypothesis; if you get rich, the null is probably false. E-values (evidence values) replace p-values and provide anytime-valid inference — like Bayesian methods — with error rate control — like frequentist methods. This could be the framework that genuinely unifies the approaches, but it's still early: it works well for simple models but hasn't yet scaled to complex, high-dimensional problems.

Check your understanding:
1. A platform claims its Bayesian testing is "immune to peeking." You know the platform uses uninformative priors and stops when P(B > A) > 0.95. Should you believe the claim? What specific question would you ask?
2. Why does Lindley's paradox become more severe as sample size increases?

The Mental Models Worth Keeping

1. Inverse Conditionals
P(data | hypothesis) and P(hypothesis | data) are completely different quantities. Confusing them is called the "prosecutor's fallacy" and is the root cause of most statistical misinterpretation. Example: When someone says "p = 0.05 means 95% chance it's real," they've flipped the conditional.

2. The Prior as a Brake
A prior isn't just a "belief" — it's a regularizer. Strong priors from previous experiments act as a brake against fluky data, just like L2 regularization in machine learning is mathematically equivalent to placing a Gaussian prior on the weights. Example: If 50 previous checkout tests showed effects between -1% and +2%, a prior encoding this range protects you from declaring a 15% lift based on one noisy test.

3. Alpha Spending as a Moving Goalpost
Sequential testing works by making significance harder to achieve early and easier late. The total "alpha budget" is 0.05, and you "spend" portions at each look. Thinking of significance as a finite resource you allocate over time makes the whole framework intuitive. Example: O'Brien-Fleming spends almost nothing early (saving power for the final analysis); Pocock spends equally at each look (maximizing early-stopping probability).

4. The Convergence Principle
With large samples and weak priors, frequentist and Bayesian answers converge (Bernstein-von Mises theorem). The practical implication: the choice between paradigms matters most when data is scarce and prior knowledge is available. With n > 100-200 and flat priors, you'll get nearly identical conclusions either way. Example: For a high-traffic site running tests with millions of users, the framework choice is about communication preference, not statistical accuracy.

5. Procedure vs. Inference
Frequentist results describe properties of the testing procedure (long-run error rates). Bayesian results describe the parameter (posterior probability). Knowing which type of answer you need dictates which approach to use. Example: A regulator who needs to guarantee "no more than 5% of approved drugs are ineffective" needs a procedure property (frequentist). A product manager who needs "what's the probability this variant is better?" needs a parameter statement (Bayesian).

What Most People Get Wrong

1. "P < 0.05 means 95% sure the effect is real"
Why people believe it: Because it's the natural, intuitive reading. What's actually true: The p-value is P(data | H₀), not P(H₀ | data). A p-value of 0.05 corresponds to a minimum Bayes Factor of only ~2.5 — the data is just 2.5x more likely under the alternative than the null. That's barely worth mentioning on the Jeffreys scale. How to spot it: Whenever someone converts a p-value directly into a confidence percentage, they've committed the inverse probability fallacy.

2. "Bayesian A/B testing lets you peek freely"
Why people believe it: Platform marketing and a technically correct appeal to the likelihood principle. What's actually true: Bayesian inference is valid under optional stopping, but Bayesian decision rules (stop when posterior > threshold) inflate false positive rates — Robinson showed a 4x inflation with daily peeking and uninformative priors. How to spot it: Ask whether the tool uses informative priors and what false positive rate simulations show under their specific stopping rule.

3. "Bayesian = subjective, Frequentist = objective"
Why people believe it: Priors look like opinions; test statistics look like math. What's actually true: Frequentist analysis requires subjective choices at every step — which test, what significance level, how to handle outliers, when to stop collecting. These "researcher degrees of freedom" affect results as much as any prior. Gelman calls frequentist methods "a desperate attempt to perform an inherently subjective task in an apparently objective way." How to spot it: Count the researcher choices in a "objective" frequentist analysis. There are always more than you'd expect.

4. "A non-significant result means no effect"
Why people believe it: The binary reject/don't-reject language suggests it. What's actually true: Absence of evidence is not evidence of absence. A non-significant result could mean the effect doesn't exist, OR that your study was underpowered. Bayesian methods with ROPE (Region of Practical Equivalence) or Bayes Factors can explicitly provide evidence for the null — frequentist testing cannot. How to spot it: When someone says "we found no effect," ask about statistical power and what the confidence interval looks like.

5. "The 0.05 threshold has scientific justification"
Why people believe it: It's universal, it's in every textbook, it must be principled. What's actually true: Fisher chose 0.05 for computational convenience in 1925 when statistical tables were calculated by hand. He later said it should be flexible, not a rigid threshold. The current universality of 0.05 is a Schelling point — everyone uses it because everyone else uses it, not because of any deep statistical reasoning.

The 5 Whys — Root Causes Worth Knowing

Chain 1: "P-values are widely misinterpreted"
Claim → People confuse P(data|H₀) with P(H₀|data) → Because human cognition naturally thinks about hypothesis probability, not data probability → Because evolutionary decision-making concerns states of the world, not sampling distributions → Because the frequentist framework was designed for repeated-sampling contexts (agriculture) where long-run procedure properties matter → Because Fisher was an agricultural scientist and the framework fit his domain → Root insight: A context-specific tool was codified as a universal method, and institutional inertia prevented adaptation.
Level 2 deep: It became self-perpetuating through textbooks.
Level 3 deep: The tools became methodology-defining rather than problem-driven.

Chain 2: "Sequential testing requires alpha spending"
Claim → Multiple looks inflate Type I error → Each look is an opportunity to cross the threshold by chance → Under the null, the test statistic follows a random walk → A random walk will eventually cross any fixed boundary (optional stopping theorem) → Noise accumulates as √n while a fixed boundary stays constant → Root insight: Alpha spending works by making the boundary a function of time, rising faster than noise, so the probability of ever crossing under H₀ is exactly α.
Level 2 deep: O'Brien-Fleming vs. Pocock represent different tradeoffs between early-stopping probability and final-analysis power.
Level 3 deep: The optimal choice depends on the cost of continuing vs. the value of certainty.

Chain 3: "The replication crisis is caused by p-value misuse"
Claim → P-hacking and selective reporting inflate false positives → Researchers have many "degrees of freedom" in analysis → Publication incentives reward significant findings → Journals and tenure committees use significance as a filter → Root insight: The incentive structure conflates discovery (finding something new) with certification (proving it's real). Switching to Bayesian methods doesn't fix this — "b-hacking" is just as possible. The real fix is procedural: pre-registration and transparency.
Level 2 deep: Pre-registration hasn't fully solved it because exploratory research shouldn't be pre-registered, and the confirmatory/exploratory boundary is fuzzy.
Level 3 deep: Analysis decisions can remain undisclosed even in pre-registered studies.

The Numbers That Matter

p = 0.05 corresponds to a Bayes Factor of only ~2.5. Most scientists interpret this as "strong evidence." On the Jeffreys scale, it's "not worth more than a bare mention." This single calibration fact explains much of the replication crisis — we've been using a threshold that counts weak evidence as convincing.

p = 0.005 corresponds to a Bayes Factor of ~14-26. This is why Benjamin et al. (2017) proposed redefining significance at 0.005. The evidence goes from "anecdotal" to "strong." The cost: dramatically reduced power unless you increase sample sizes.

Peeking at 10 interim analyses inflates your false positive rate from 5% to ~19%. That's nearly 1 in 5 tests declaring a winner when there is none. This is why every major experimentation platform has invested in sequential testing methods.

SPRT reduces average sample size by 40-50% compared to fixed-sample tests at the same error rates. To put that in perspective: if your test would normally need 100,000 visitors, SPRT gets the same answer with 50,000-60,000 on average. That's weeks of testing time saved.

With n > 100-200 and flat priors, Bayesian and frequentist results nearly converge. The Bernstein-von Mises theorem guarantees this. If you're running tests with millions of users, the paradigm choice is about stakeholder communication, not statistical accuracy.

Robinson's Bayesian peeking simulation: false positives went from 2.5% to 11.8%. A 4x inflation — devastating for anyone who believed "Bayesian testing is immune to peeking." With diffuse priors, the inflation approaches frequentist levels.

GST achieves ~90% power vs. mSPRT's ~71-77% (Spotify's 100,000-simulation study). But GST requires knowing the expected sample size in advance. The core tradeoff: power vs. flexibility.

If 90% of tested hypotheses are null (common in exploratory research) and you test 1,000 at α = 0.05 with 80% power, 36% of your "significant" findings are false positives. Not 5% — 36%. The base rate of true effects matters enormously, and p-values don't account for it.

Where Smart People Disagree

Should we abandon p-values entirely?
Wasserstein & Lazar (ASA, 2019) argue yes: p-values encourage binary thinking, are routinely misinterpreted, and don't answer the question researchers want. Deborah Mayo (error statistics) argues the problem is misuse, not the tool. Gelman proposes moving past both p-values and Bayes factors toward estimation and model checking. Unresolved because there's no agreement on what should replace p-values — Bayes Factors have their own problems, and "just report effect sizes" leaves decision-makers without a decision framework.

Is Bayesian A/B testing immune to peeking?
VWO and AB Tasty say yes (in marketing materials). Robinson and Analytics Toolkit say no (with simulation evidence). The technical answer is nuanced: Bayesian inference is valid regardless of stopping, but Bayesian decision rules are not. This hasn't been resolved because the two sides define "immune" differently, and the practical impact depends heavily on prior strength — a variable that differs across implementations.

What should the significance threshold be?
Benjamin et al. (2017): α = 0.005, based on p-to-Bayes-Factor calibration. Lakens et al. (2018): "Justify Your Alpha" — no single threshold fits all contexts. Bayesians: the question is misguided; report continuous evidence. Unresolved because each position optimizes for a different value — standardization, context-sensitivity, or philosophical coherence — and there's no consensus on which value matters most.

Are Bayesian methods truly subjective?
The classical view: yes, priors encode personal beliefs. Gelman's pragmatic view: priors are structural modeling choices, not beliefs, and should be validated with frequentist properties (calibration, coverage). De Finetti's school: subjectivity is a feature, not a bug — the only coherent framework for uncertainty. This remains unresolved because it's partly a philosophical question about the meaning of probability, which isn't empirically testable.

What You Don't Know Yet (And That's OK)

E-values and testing by betting. This emerging framework (Shafer, Vovk, Ramdas) could unify sequential, Bayesian, and frequentist ideas — but it's still maturing and limited to simple statistical models. You know it exists; you don't yet know how to use it.

Optimal prior specification at scale. No one has solved how to automatically choose good priors across thousands of A/B tests. Empirical Bayes (learning priors from historical data) is promising but requires sufficient experimental history. This is an active research frontier.

How sequential methods interact with variance reduction. Techniques like CUPED and stratification are standard in modern experimentation, but their interaction with sequential stopping rules is still being worked out (Spotify published on this in 2023, but the area is evolving).

The full implications of the FDA's 2026 Bayesian guidance. This could reshape pharmaceutical statistics, but the guidance is still in draft. Whether it actually shifts practice from frequentist dominance remains to be seen.

Bias correction at sequential stopping times. When you stop a sequential test early because the effect looks large, your point estimate is biased upward (you stopped partly because the noise was favorable). Methods to correct this exist but add complexity, and their practical importance is debated.

Subtopics to Explore Next

1. Sequential Testing Methods (SPRT, mSPRT, Group Sequential Tests)
Why it's worth it: This is the practical skill that immediately improves how you run experiments — understanding which sequential method to use and why.
Start with: Spotify's engineering blog post "Choosing a Sequential Testing Framework" for an excellent comparison with real simulation data.
Estimated depth: Medium (half day)

2. Bayes Factors and Evidence Calibration
Why it's worth it: Understanding the Jeffreys scale and p-to-BF calibration gives you a rigorous way to evaluate how strong evidence really is, regardless of which paradigm you use.
Start with: The Kass & Raftery (1995) framework, then explore Sellke et al. (2001) for the p-value calibration.
Estimated depth: Medium (half day)

3. Prior Specification for A/B Testing
Why it's worth it: The prior is the single biggest practical challenge in Bayesian testing — learning to set good priors is what separates useful Bayesian analysis from theater.
Start with: "What makes a good prior for A/B testing?" — explore informative vs. weakly informative vs. reference priors with concrete examples.
Estimated depth: Medium (half day)

4. The Replication Crisis and Pre-Registration
Why it's worth it: Understanding why so many published findings are false sharpens your ability to design trustworthy experiments in any paradigm.
Start with: Ioannidis (2005) "Why Most Published Research Findings Are False" and the ASA's 2016 p-value statement.
Estimated depth: Surface (1-2 hours)

5. ROPE (Region of Practical Equivalence)
Why it's worth it: ROPE lets you test for the null — proving an effect is negligibly small, not just "not significant." This is critical for optimization programs where you need to know when to stop iterating.
Start with: The bayestestR ROPE vignette for practical implementation.
Estimated depth: Surface (1-2 hours)

6. Alpha Spending Functions and Boundary Design
Why it's worth it: Mastering O'Brien-Fleming vs. Pocock vs. custom spending functions lets you design sequential tests that match your business tradeoffs (early decisions vs. final-analysis power).
Start with: Penn State STAT 509, Lesson 9.6 on alpha spending, then the Lan-DeMets approach.
Estimated depth: Medium (half day)

7. E-Values and Anytime-Valid Inference
Why it's worth it: This is the frontier that may resolve the frequentist-Bayesian divide for sequential testing — understanding it positions you ahead of most practitioners.
Start with: The Ramdas et al. (2023) tutorial on e-values and game-theoretic statistics.
Estimated depth: Deep (multi-day)

8. Bayesian Methods in Clinical Trials (FDA 2026 Guidance)
Why it's worth it: The FDA guidance represents a seismic shift in how the most regulated industry in the world evaluates evidence — understanding it reveals where statistical practice is heading.
Start with: The FDA draft guidance itself (Jan 2026) and the arXiv commentary (2601.14701v1).
Estimated depth: Deep (multi-day)

Key Takeaways

The question you ask determines the paradigm you need. "How surprising is this data?" is frequentist. "How probable is this hypothesis?" is Bayesian. Neither is wrong — they answer different questions.
P = 0.05 is weak evidence, not strong. It corresponds to a Bayes Factor of only ~2.5. Treating it as convincing is a calibration error baked into scientific culture.
Everyone peeks. Design for it. The question isn't whether experimenters will check results early — they will. The question is whether your method makes peeking safe. Sequential testing does. Pretending peeking won't happen does not.
Bayesian "peeking immunity" is real for inference but false for decisions. The posterior is always valid, but stopping rules based on posterior thresholds inflate error rates. The distinction between inference and decision is critical.
With large samples and weak priors, the paradigm choice barely matters statistically. The Bernstein-von Mises theorem guarantees convergence. At that point, you're choosing based on communication and stakeholder preferences, not accuracy.
The framework most practitioners use was never endorsed by any of its founders. The hybrid Fisher/Neyman-Pearson approach in textbooks is philosophically incoherent — a Frankenstein's monster that both camps would have rejected.
Prior choice is a modeling decision, not a confession of bias. Priors are regularizers. L2 regularization is a Gaussian prior. Calling Bayesian methods "subjective" while ignoring the subjective choices in frequentist analysis is intellectually inconsistent.
The base rate of true effects matters more than your significance level. In a field where most hypotheses are false, even α = 0.05 with 80% power produces a 36% false discovery rate. No amount of statistical rigor fixes a bad base rate.
Sequential testing is strictly more efficient than fixed-sample testing. SPRT achieves a 40-50% reduction in average sample size at identical error rates. There is almost no good reason to run a fixed-sample test when a sequential alternative exists.
The replication crisis is a procedural problem, not a paradigm problem. Switching from frequentist to Bayesian doesn't fix p-hacking — it just renames it b-hacking. Pre-registration and transparency help regardless of framework.
Hybrid approaches are increasingly the pragmatic answer. Use frequentist logic for error control, Bayesian reporting for stakeholder communication. The paradigm war is less important than getting the science right.
The 0.05 threshold is a Schelling point, not a scientific discovery. It persists because coordination is valuable even when arbitrary. But "everyone uses it" is not a justification — it's an explanation.
When frameworks disagree (Lindley's paradox), ask which question you should be asking. The paradox isn't a bug in either system — it's a signal that your question needs refinement.

Sources Used in This Research

Primary Research:
- ASA Statement on P-Values (Wasserstein & Lazar, 2016) — foundational critique of p-value practice
- Always Valid Inference (Johari et al., 2015) — bringing sequential analysis to A/B testing
- History and Nature of the Jeffreys-Lindley Paradox (PMC, 2024)
- Optimal E-value Testing (arXiv, 2024)
- FDA Draft Guidance on Bayesian Methodology in Clinical Trials (Jan 2026)
- Regulatory Expectations for Bayesian Methods: FDA's 2026 Guidance (arXiv, 2026)
- Why Optional Stopping Can Be a Problem for Bayesians (PMC, 2021)
- Efficiency in Sequential Testing: SPRT vs Bayes Factor Test (Springer, 2021)
- Closed-Form Power and Sample Size Calculations for Bayes Factors (American Statistician, 2025)
- Testing Fisher, Neyman, Pearson, and Bayes (Christensen)

Expert Commentary:
- "Bayesians are Frequentists" — Andrew Gelman (2024)
- Philosophy and the Practice of Bayesian Statistics — Gelman & Shalizi
- "Is Bayesian A/B Testing Immune to Peeking? Not Exactly" — David Robinson
- "Bayesian AB Testing is Not Immune to Optional Stopping Issues" — Analytics Toolkit (2017)
- "Could Fisher, Jeffreys and Neyman Have Agreed?" — Berger
- Choosing a Sequential Testing Framework — Spotify Engineering (2023)
- Sequential A/B Testing Keeps the World Streaming — Netflix (Parts 1 & 2)
- "To Be a Frequentist or Bayesian? Five Positions in a Spectrum" — HDSR (2024)
- The Bet Test: Spotting Problems in Bayesian Analysis — Eppo

Good Journalism:
- Peeking at A/B Tests: Continuous Monitoring Without Pain — The Morning Paper (2017)
- Bayesian vs Frequentist A/B Testing — CXL
- Comparing Frequentist vs. Bayesian vs. Sequential — Eppo
- Bayesian vs Frequentist — Statsig

Reference:
- Foundations of Statistics, Bayes Factor, Frequentist Inference, Likelihood Principle, Credible Interval — Wikipedia
- Jeffreys' Scale — Statlect
- Alpha Spending Function — Penn State STAT 509
- ROPE — bayestestR documentation
- Sequential Testing — GrowthBook documentation

Textbooks Referenced (not fetched):
- Jeffreys, H. (1939). Theory of Probability
- Wald, A. (1947). Sequential Analysis
- Fisher, R.A. (1925). Statistical Methods for Research Workers
- Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis
- Gelman, A. et al. (2013). Bayesian Data Analysis (3rd ed.)