How to Measure GEO Performance: The Weekly Operator's Playbook for 2026

Hugo Debrabandere

Hugo Debrabandere

Co-founder · Clairon

Apr 28, 2026

Most teams open their citation-share dashboard on Monday morning, spot a 4-point swing, panic, and ship a content sprint. Half of those swings are statistical phantoms. They would have reverted by Friday whether the team intervened or not. The other half are real signals. Telling them apart is what separates the measurement practices that hold up in a CMO review from the ones that get cut at the next budget cycle.

This is the operational deep-dive that the pillar article on GEO tools and analytics gestures at without unpacking. The pillar covers the four metrics. This piece covers the practice: the prompt set, the sampling rule, the noise floor, and the 90-minute Monday loop you can defend to your CFO.

30
minimum runs per prompt per engine for a 95% CI to detect a 10-point SOV change (MaximusLabs, 2026)
31%
median answer churn between two ChatGPT runs of the same prompt within 48h (Clairon, Q1 2026)
73%
of brand-presence cases involve a ghost citation, your domain cited without your brand named (Superlines, 2026)

Why your weekly number is probably statistical noise

The single biggest mistake we see in 2026 GEO measurement: treating a single weekly citation-share number as a signal. It almost never is. MaximusLabs ran the math: at 30 runs of a single prompt, a 30% observed citation share has a 95% confidence interval of roughly [13.6%, 46.4%]. That is a thirty-three-point window. A 4-point swing inside that bracket means literally nothing.

The polling industry solved this problem in 1936 (Gallup, after predicting Roosevelt). The solution is the same: take a wide sample, declare a margin of error, gate every action behind it. Most GEO teams skip this step entirely, which is why their weekly Slack panic-to-action ratio is around 70% false-positive.

Three failures the SERP under-answers

  • Sample size theatre. Most articles tell you to run "20 to 30 prompts" without specifying how many runs per prompt, which is the only number that controls the CI. Twenty prompts at one run each is statistically meaningless.
  • Branded blind spot. Almost every measurement tutorial on the SERP defaults to 100% unbranded prompts. Without a 20% branded slice, you cannot detect hallucinations, sentiment drift, or ghost citations, which are the most damaging failure modes.
  • No decision rule. Articles ship dashboards without telling you when to act. Without a noise floor, every Monday becomes a debate. Sample sizes get bigger and bigger to silence the noise instead of capping the action threshold.

The 200-prompt set, exactly

Below is the prescriptive recipe. Paste it, fill it, run it. The 200 number is not arbitrary, it is the smallest set that survives the 30-sample CI requirement, covers the funnel, and keeps API and scraping costs under roughly $500 a month.

Why 200 (not 50, not 1,000)

At 30 runs per prompt per engine: 200 prompts × 6 engines × 30 runs = 36,000 monthly samples. Below 100 prompts, freshness fails (your prompt library cannot reflect the 11 unbranded query categories most B2B SaaS verticals span). Above 300, the marginal prompt teaches you nothing new, the variance floor flattens, and you waste budget on confirming what you already measured.

Category quotas

The percentages are the part to defend. Adjust the absolute counts if you have less budget, never the ratios.

Category quotas, 200 prompts
CategoryCountShareExample shape
Category-defining ("what is X")2010%What is generative engine optimization?
Comparison ("X vs Y", "best X for Y")5025%Best GEO tool for B2B SaaS
Problem-aware ("how do I…")5025%How do I measure my brand's ChatGPT citation rate?
Solution-aware ("tools / platforms for…")4020%Tools to track Perplexity citations
Branded / defensive2010%Is Clairon worth $49 a month?
Long-tail jobs-to-be-done2010%How to set up a weekly AI visibility report for a 10-person startup

Intent split (orthogonal to category)

  • Informational: 50%. Wide funnel, where citation share is most volatile. The slice that tells you whether your content is shaped right.
  • Commercial investigation: 35%. Where conversion happens. The slice you defend against competitors and the one that maps to pipeline.
  • Transactional: 15%. Pricing, sign-up, "is X worth it" queries. Small but high-intent. Most teams skip this slice and miss their highest-conversion engine surfaces.

Branded versus unbranded ratio

80/20, unbranded to branded. The 20% branded slice catches the three failure modes the SERP misses:

  1. Hallucinations. ChatGPT invents pricing, integrations, or features for your product. Detected only by running branded prompts and grading factual accuracy.
  2. Sentiment drift. A model starts caveating your brand or comparing you unfavorably to a competitor. Visible only on branded prompts where the engine is asked to opine.
  3. Ghost citations. Your domain cited but your brand not named (Superlines puts this at 73% of brand-presence cases). The branded prompt set is the only way to surface them.

Geographic distribution

  • 70% locale-neutral English. The default, your baseline citation share.
  • 20% explicit-geo."Best GEO tool in the UK," " French alternative to Profound." Surfaces the engine’s geo-bias and your eligibility in non-US AI Overviews.
  • 10% non-English.French, German, Spanish if you sell there. Required because Perplexity’s citation pool shifts roughly 30% across languages on the same query.

Engine coverage

All 200 prompts run on all 6 engines weekly: ChatGPT, Claude, Perplexity, Gemini, Copilot, Google AI Overviews. Do not drop Claude because "it does not cite." Its zero-citation footprint is itself a metric. Superlines documents a 615× gap in citation volume across platforms for the same brand in the same 30-day window. Skipping engines is how you build a confident-looking dashboard with a hole the size of your technical buyer.

Sampling discipline

The two rules that separate a defensible measurement system from dashboard theatre, neither of which most articles ship.

Thirty runs per prompt per engine

Distribute the 30 runs across 7 to 14 days, not all on Monday. Diurnal variance is real, especially on engines with retrieval caching. A prompt run at 09:00 UTC Monday and the same prompt at 21:00 UTC Friday are two different samples of the same model state.

Eight-week rolling baseline

Every week-over-week comparison is computed against the rolling 8-week mean and standard deviation, not against last week. Last week is one sample, the rolling mean is fifty-six. Reporting on last week alone is how you ship a 4-point Slack panic that the model corrects on its own by Wednesday.

The ±2 SD noise floor (the rule that ends weekly panic)

The single decision rule the SERP under-answers. Adopt this and your week-over-week conversation changes shape entirely.

text
# Compute weekly
sigma_8w        = std_dev(citation_share, last_8_weeks)
mean_8w         = mean(citation_share, last_8_weeks)
delta_this_week = citation_share[this_week] - mean_8w
z               = delta_this_week / sigma_8w

# Decision
if abs(z) <= 2:                 # noise floor
    action = "log only, no Slack panic"
elif abs(z) > 2 and weeks == 1: # one-off outlier
    action = "investigate this week, do not ship a sprint yet"
elif abs(z) > 2 and weeks >= 2: # confirmed signal
    action = "escalate: rewrite first 80 words of H2 hitting the prompt cluster"

Movements inside ±2 SD: log and ignore. Movements outside ±2 SD: investigate the same week. Movements outside ±2 SD for two consecutive weeks: escalate to a content or PR action. This single rule kills roughly 70% of false-positive Monday panics in our customer set.

The 90-minute Monday loop

The operational unit. Calendar block one Monday morning, run this once a week, ship the report by 10:30 AM. Averi ships a similar 6-section template that times out to 90 minutes, we have iterated on theirs for two quarters.

Pull the 200-prompt × 6-engine × 30-run sample (45 minutes)

Either via your tool of choice (Profound, Peec, Clairon) or a custom script against the engine APIs. Most of this should run automated, you only intervene if a run fails. Output: one citation-share number per prompt per engine for the week.

Compute the z-score against the 8-week baseline (10 minutes)

Either inside the tool or in your Google Sheet. Output: a list of prompts with abs(z) > 2 this week. This is your weekly action queue.

Triage the queue with the 4-row table (20 minutes)

Apply the triage table in the next section to every flagged prompt. Output: list of actions, list of escalations, list of items to monitor for a second week.

Write the 6-section weekly report (15 minutes)

Sections: SOV per engine, week-over-week delta with z-scores, top 5 winning prompts, top 5 losing prompts, ghost citation count, action queue. Total length: under 600 words. Slack-ready.

The 4-row weekly triage table

What to do, by signal. Keep this in your weekly report header so everyone reads from the same playbook.

Weekly triage decisions
SignalAction this weekEscalate when
SOV down 1 to 2 SD on 1 engineLog, check ghost-citation columnTwo consecutive weeks below baseline
SOV down >2 SD on 1 enginePull the last 30 responses, diff cited sources versus baselineSame week, brief content owner
New competitor enters top-3 cited brandsAdd competitor to tracked set, log their cited URLsCompetitor holds top-3 for 2 weeks
Owned page drops out of citation poolRe-check freshness date and rewrite first 80 words of H2Page > 60 days unchanged with no recovery
Hallucination on branded promptImmediate paragraph rewrite plus outreach to incorrectly-cited sourceAny occurrence
The job of the weekly loop is not to act on every number. It is to ignore the numbers that do not deserve action.
Internal Clairon playbook·GEO measurement principle #2

Quarterly: re-seed the prompt set

The prompt set itself decays. Without rotation, your library slowly becomes a museum of last quarter’s search intent. Once a quarter, run this:

  1. Retire the 20 lowest-variance prompts. The ones where the citation share has not moved more than 1 SD in a quarter. The signal is saturated, the prompt teaches you nothing.
  2. Replace with 20 new prompts. Pull from your customer-support transcripts (high-intent jobs-to-be-done) and SERP People-Also-Ask data (rising informational queries).
  3. Re-baseline. Drop the 8-week rolling window for the new prompts and rebuild it over the next eight weeks. Do not mix replaced prompts into a baseline they were never in.
  4. Document the rotation. One paragraph in the quarterly report explaining what rotated and why. This is what your CMO will ask you about in the QBR.

Cost and tool selection

The 200-prompt × 6-engine × 30-run system is feasible across three cost tiers. Pick the one that matches your week-1 to week-12 phase.

Three feasible measurement stacks
TierStackMonthly costLabor / week
Manual / freeGoogle Sheet, Make.com or Zapier, browser scripts$08 to 12 hours
Mid-market specialistProfound Growth, Peec AI Pro, or Clairon$49 to $4991 to 2 hours
EnterpriseProfound Enterprise, Scrunch, or AthenaHQ Enterprise$2,000 to $5,000+Under 1 hour, dedicated analyst

Most teams under $50M ARR overspend in week 1 on enterprise tools and underspend in week 12 on labor. The mid-market specialist tier is the right answer for 70% of B2B SaaS in this revenue band. The 9-tool teardown ships honest verdict slots for each ARR stage.

Where to go deeper

This article sits inside a six-piece GEO measurement cluster. The companion playbooks below cover the metrics vocabulary, the tooling teardown, the competitor baseline, and the dashboard you show your CMO every Friday.

The teams that win citation share are not the ones that measure the most. They are the ones that ignore the right 70% of weekly movement and act on the 30% that survives the noise floor.

Frequently asked questions

How many prompts is enough, and how do I know I'm not just sampling noise?
Two hundred prompts at thirty runs per engine per week is the smallest set that gives you a 95% confidence interval narrow enough to detect a 10-point share-of-voice change. MaximusLabs derives this from a polling-style margin of error (1.96 standard deviations). At 30 runs, a 30% citation share has a CI of roughly [13.6%, 46.4%], so any single weekly number under that bracket is statistical noise.
What is a defensible noise floor before I act on a weekly drop?
A ±2 standard deviation rolling rule against an 8-week baseline. Any movement inside ±2 SD is noise, log and ignore. Any movement outside ±2 SD is investigated the same week. Movements outside ±2 SD for two consecutive weeks escalate to content rewrite or PR action. This single rule ends roughly 70% of false-positive Monday panics.
How do I split branded versus unbranded prompts, and why does it matter?
Eighty percent unbranded, twenty percent branded. The 20% branded slice is the only way to detect three failure modes most teams miss: hallucinations (ChatGPT inventing pricing or features), sentiment drift (a model starts caveating your brand), and ghost citations (your domain cited but your brand not named, which Superlines puts at 73% of brand-presence cases).
What do I do when ChatGPT cites my domain but not my brand name?
Ghost citations are a passage problem, not a domain problem. The cited URL has a quotable answer paragraph, but the brand identifier is missing or buried. Open the cited page, find the answer paragraph, ensure the brand name appears in the first 40 words alongside the answer. We see ghost citations convert to named citations within two refresh cycles in roughly 60% of cases when this fix is applied.
How do I measure on Claude when Claude rarely cites?
The zero-citation footprint is itself a metric. Track Claude's mention rate (brand named in answer text without a link) separately from citation rate. Claude cites less but quotes longer passages, so a Claude mention with a verbatim quote is worth more on conversion than a generic ChatGPT citation. Don't drop the engine, change what you measure on it.
How often should I refresh the prompt set itself?
Quarterly. Retire the twenty prompts with lowest variance over the last quarter (the ones where the answer no longer changes, signal saturated) and replace them with twenty new prompts pulled from your customer-support transcripts and SERP People-Also-Ask data. Without this rotation, your prompt library decays faster than your citation share.
What is the realistic monthly cost of a 200-prompt × 6-engine × 30-run measurement system?
Manual on a free tier: free, but 8 to 12 hours of labor per week. API-based with a tool like Profound or Peec: $400 to $800 per month for the volume, with most of the labor automated. Clairon entry at $49 per month covers the 200-prompt × 6-engine × 30-run sample plus paragraph-level rewrite suggestions on cited URLs. The break-even versus manual is around week three.
When do I escalate from measurement to content rewrite?
When a prompt cluster sits outside ±2 SD for two consecutive weeks, or when a single prompt drops a brand from the cited set entirely. The rewrite unit is the answer paragraph, not the page. Rewrite the first 80 words of the H2 the prompt is hitting, ensure the brand and the answer appear in the same sentence, ship a dateModified bump, re-measure in 7 days.
Summarize with Claude
Summarize with Perplexity
Summarize with Google
Summarize with Grok
Summarize with ChatGPT