How Long Should You Run an A/B Test? (The Honest Answer)

Q: Can I stop a test as soon as it hits 95% significance?

No, not if you have been peeking. Auto-stopping at the first crossing inflates your false positive rate. Decide the stopping rule up front, or use a sequential method designed for continuous monitoring.

The short version: run for at least one full week, until you have enough samples per variant to detect a realistic effect, and never stop the moment the numbers look good. The long version — why each of those rules exists, and the rare case where you can stop early — is what actually protects you from shipping a "winner" that quietly loses money. Here it is.

New to experimentation? Start with what is A/B testing. If you want AI to handle the timing and statistics for you, see how to use AI in A/B testing.

The three things that actually determine test duration

"How long" is the wrong question on its own. Duration is downstream of three real constraints. A test is ready to read only when all three are satisfied — not when any one of them feels done.

Calendar time: at least one full business cycle (7 days), to average out day-of-week effects and let novelty effects decay.
Sample size: enough visitors per variant to detect the smallest lift you care about. See our guide on A/B testing sample size.
Statistical confidence: a real significance threshold reached legitimately — not by peeking until it crosses.

Why at least one full week

Your traffic is not uniform across a week. Tuesday-morning visitors differ from Saturday-evening visitors in intent, traffic source, and conversion baseline. A winner declared on day 3 is built entirely from weekday visitors and may genuinely lose on the weekend. Running for a full weekly cycle averages across the whole audience instead of a biased slice of it. This is the single most cited rule in CRO methodology: cover at least one complete business cycle, and if your buying cycle is longer (B2B often is), cover two.

Why not to stop the instant a variant "wins"

Two traps sink early calls. The first is the novelty effect: returning visitors notice "something is different" and engage at an elevated rate for the first few exposures. That honeymoon decays over roughly 3–10 days. Stop at day 2 and you bake the honeymoon into your estimate, overstating the true steady-state lift — worst on highly visible changes like hero images, layouts, and new CTAs.

The second is peeking. Every time you check a running test and let yourself act on what you see, you give random noise another chance to cross your threshold. Auto-stopping at the first moment confidence hits 95% has been shown empirically to inflate the real false-positive rate well above the 5% you think you are accepting. The fix is to decide your stopping rule in advance, or to use a method built for sequential checking.

The one time you can stop early

There is a legitimate escape hatch: when the win is overwhelming, extra days add almost no statistical value and cost you real conversions, because half your traffic is still seeing the loser. abTestBot's loops apply this override only when all of these hold at once: 99%+ confidence, ≥20% relative lift, the pessimistic (lower credible-interval) estimate still shows a 10%+ win, ≥500 visitors per variant, and at least 2 calendar days elapsed. Outside that ceiling, hold the line.

How abTestBot handles duration for you

Rather than asking you to track all three constraints by hand, abTestBot's Continuous Loops enforce them automatically: a 7-day round floor, a 500-samples-per-arm minimum, a 95% probability-to-win threshold, and a positive lower credible bound before any winner is promoted — plus the overwhelming-win override above. For the statistics behind those defaults, read Bayesian vs frequentist testing and the Continuous Loops documentation.

Frequently asked questions

What's the minimum time to run an A/B test?

Seven days, to cover a full weekly cycle — and only if you have also reached your required sample size per variant. Time alone is never sufficient.

Can I stop a test as soon as it hits 95% significance?

No — not if you've been peeking. Auto-stopping at the first crossing inflates your false positive rate. Decide the stopping rule up front, or use a sequential method designed for continuous monitoring.

How long is too long?

If a test runs for several weeks with no clear result, the realistic effect is probably smaller than your traffic can detect. That's a signal to test a bolder change or a higher-traffic page rather than waiting indefinitely.

Let the math run itself

abTestBot enforces run-time, sample-size, and confidence thresholds automatically — so you ship winners that actually hold, not noise. Paste your URL to get started.

Get started free →