How Do AI Search Engines Work? The 5-Stage Pipeline Behind ChatGPT, Claude, Perplexity and Google AI Overviews

Hugo Debrabandere

Hugo Debrabandere

Co-founder · Clairon

Apr 28, 2026

If you have spent a decade thinking about Google’s index, the inside of an AI search engine is going to feel inverted. Pages are no longer the unit of analysis. Sentences are. Backlinks are no longer the dominant authority signal. Corroboration is.A page that ranks #1 on Google can be invisible inside ChatGPT, and a page nobody has heard of can be cited 7 times in Claude’s first answer. This article explains why, by walking through the actual pipeline these engines run.

We will go stage by stage, name the engines and their indexes, and at every step pull out what the writer should change.

Stage 1: Understanding the query

Before retrieval, the engine has to figure out what you actually meant. This stage looks deceptively simple. It is the most important stage for entity-aware brands.

What happens

The model rewrites the raw user prompt into a retrieval-friendly internal representation. Three things happen.

  • Entity disambiguation.“Apex” gets matched against a knowledge graph to determine which Apex you mean (the climbing brand, the legal software, the rocket company). The engine pulls in candidate entities from sources like Wikipedia, Google’s Knowledge Graph, Crunchbase, and LinkedIn.
  • Query rewriting. A casual query like “is apex any good” gets internally rewritten into something closer to “reviews and assessments of Apex Legal Software for mid-market law firms.”The rewrite is informed by recent context and the engine’s prior beliefs about the user.
  • Intent classification. The query gets tagged: informational, commercial, navigational, comparison, voice-style. Different intents route to different downstream behaviors.

Writer’s takeaway

The engine cannot cite you if it cannot identify you. Identity is the first signal in our Citation Trinity framework for a reason. Practical implications: ship Organization schema, link your site to Wikipedia and Crunchbase entries, ensure consistent naming across G2, LinkedIn and your own site. Brands with messy disambiguation get filtered out before the retrieval step even runs.

Stage 2: Query fan-out

Most users do not realize their single question becomes 4 to 12 questions inside the engine. Understanding this stage explains why some pages get cited on queries you would not expect.

What happens

A single query like “best CRM for plumbers” gets decomposed into parallel sub-queries:

  • “best CRM for small business”
  • “CRM features for field service teams”
  • “industry-specific CRM for plumbing”
  • “CRM with mobile app for technicians”
  • “CRM pricing for under 50 users”

Each sub-query hits the index independently. The engine then merges the candidate passages from all sub-queries into one ranking pool.

Writer’s takeaway

This is why niche pages get cited. The engine is not looking for a single page that answers the whole query. It is looking for the best passages on each sub-query, then assembling them. A page that nails one specific sub-query (“CRM with mobile app for HVAC technicians”) can be cited as part of an answer to a broader query (“best CRM for plumbers”).

The implication: write to specific sub-queries with answer-first H2s, even if you think the parent query is over-served. The page that answers the sub-question well wins a citation slot in answers to all the parent queries that decompose into it.

Stage 3: Retrieval (where the engines diverge most)

This is where the four major engines actually look different from each other. The retrieval step is what determines whether your page even enters the candidate pool.

What happens

Each engine maintains an index. The index gets queried with the sub-queries from Stage 2. Matching passages get returned. The shape of the index varies.

EngineIndex sourceCrawlerIndex granularity
ChatGPT (Search mode)Bing's indexBingBotPage-level + passage
ClaudeDirect web fetch (no own index for citations)ClaudeBotPage fetched at query time, chunked dynamically
PerplexityProprietary Sonar index + third-partyPerplexityBotSub-document fragments (5 to 7 token snippets)
Google AI Overviews / GeminiGoogle's main indexGooglebotPage-level + passage + entity graph

The Perplexity sub-document detail

Perplexity is the most architecturally distinct. Per a 2024 interview with Perplexity executives (Search Engine Journal), the engine indexes 5 to 7 token snippets, roughly 2 to 4 words each, and at retrieval time pulls about 130,000 tokens of the most relevant snippets into the model’s context window. The synthesis happens against that snippet pool, not against full pages.

This is fundamentally different from Bing or Google’s traditional indexing. The page does not have to “rank” at the document level for Perplexity. A two-sentence passage on an obscure page can win a citation if those two sentences are precisely the snippets the model needs.

Writer’s takeaway

The 130,000 token detail explains why Perplexity is more generous with citations than ChatGPT or Claude. It also explains why short, fact-dense passages outperform long, narrative paragraphs across all four engines. Practical implication: write paragraphs that are self-contained at the 60 to 100 word level. Each paragraph should make one claim, name one source, and read sensibly even if the surrounding paragraphs were removed.

Stage 4: Re-ranking

Retrieval returns 50 to 500 candidate passages, depending on the engine. Re-ranking trims that to 5 to 20. This stage is where most pages die.

What happens

The re-ranker is a smaller, faster model that scores each candidate passage on multiple axes. The exact axes are proprietary to each engine, but our measurements and the academic literature agree on a core set.

  • Extractability. Does the passage answer the question in its first 1 to 2 sentences? Self-contained passages score higher.
  • Authority signals. Domain trust, entity disambiguation strength, sameAs schema, and presence in third-party sources.
  • Corroboration. Does the passage’s claim match other independent sources? Outlier claims get downweighted.
  • Freshness. Recency signals matter more here than in classic search. Fifty percent of cited content across the four engines was published or updated in the last 13 weeks, per Amsive and Seer Interactive analyses.
  • Format fit. Tables and structured lists rank higher than prose for comparison-style sub-queries. Question-form H2s rank higher than noun-phrase H2s.

What gets killed at this stage

Passages that fail re-ranking even though they were retrieved tend to share a few patterns:

  • They start with story or scene-setting and bury the answer.
  • They make claims that no other source repeats (orphan claims).
  • They use vague quantifiers (“studies show”) instead of named sources.
  • They have no visible last-updated date or have a date the engine can detect as backdated.

Writer’s takeaway

Re-ranking is where the Citation Trinity (Identity, Extractability, Corroboration) does most of its work. Pages that ship answer-first H2s, name sources at the rate of one per 150 words, and link out to authoritative third parties pass re-ranking at multi-times the rate of pages that do not. This is the highest-impact zone of the entire pipeline.

Stage 5: Generation

The final stage takes the 5 to 20 surviving passages and writes the answer. Two patterns exist.

Pattern 1: Synthesis

The model reads all the surviving passages, decides which claims are well-supported, and writes a new answer in its own voice that mixes information from multiple sources. ChatGPT, Claude and Gemini default to synthesis. The brand mention shows up as a name-drop or a citation, often paraphrased rather than quoted.

Pattern 2: Extraction

The model picks one or two of the surviving passages and presents them with minimal rewriting. Voice assistants, featured snippets, and Google AI Overviews lean here. The brand shows up as a verbatim or near-verbatim quote.

Perplexity does both, depending on the query: it extracts when the answer is short and synthesizes when the answer is long. ChatGPT and Claude do mostly synthesis but occasionally extract.

Writer’s takeaway

This is why writing for both extraction and synthesis matters. A 50-word answer block under each H2 wins extraction. The 800-word body around it wins synthesis (because the synthesized answer reaches into the supporting paragraphs for additional facts). Writing for only one shape means losing the other half of the citation surface.

We unpack the two patterns in detail in GEO vs AEO.

Engine-by-engine architecture summary

The pipeline above is shared in shape. The implementation details differ. This is the cleanest summary we have for the four engines as of April 2026.

EngineIndexRetrieval styleCitation behaviorCrawler
ChatGPT (Search)BingPage + passageNames brands, sometimes linksOAI-SearchBot, ChatGPT-User
ClaudeDirect fetch (no own index)Live fetch + chunkNames brands and quotes passagesClaudeBot
PerplexityProprietary Sonar (Llama 3.3) + third-partySub-document fragmentsHeavy citations with clickable sourcesPerplexityBot
Gemini / AI OverviewsGoogle main index + Knowledge GraphPage + passage + entityMixed extraction and synthesisGooglebot

A note on crawler access

If your site blocks any of these crawlers in robots.txt, you become invisible to that engine. Per a Press Gazette analysis, 80% of news publishers block at least one AI training crawler, often inadvertently locking themselves out of the citation pool. Audit your robots.txt quarterly. Treat ClaudeBot, PerplexityBot, OAI-SearchBot, ChatGPT-User and Google-Extended as critical access (allow them) unless you have a deliberate reason to block.

The single most important architectural shift for writers

If you read no other section, read this one.

In classic Google-era search, the unit of competition was the page. You wrote a long, comprehensive page on a topic, you accumulated backlinks to that page, the page ranked.

In AI search, the unit of competition is the passage. The page still exists, but the engine retrieves and re-ranks at the 40 to 200 word level. A page is a collection of independently-citable passages. Pages with 12 great passages outperform pages with one great passage and 11 mediocre ones, because each passage is a separate citation opportunity.

40-200
Words per retrievable passage
130k
Tokens of snippets pulled per Perplexity query
13 wks
Median age of cited content across engines

This is the structural reason the Princeton GEO research (Aggarwal et al., arXiv 2311.09735) found that the highest-impact edits are at the passage level (Statistics Addition, Quotation Addition, Cite Sources). Each of those edits lifts the citability of the passage it is applied to, and the citation share of the page compounds across all its passages.

Where the pipeline is going

Three architectural shifts are visible in 2026 that will reshape how AI search works over the next 18 months.

  • Sub-document indexing becomes the default. Perplexity is the leader here. Google has begun migrating AI Overviews toward more granular passage retrieval. ChatGPT and Claude will follow. The implication: paragraph-level writing matters more, not less, over time.
  • Corroboration networks get formalized. Engines are starting to weight sources higher when they are cross-referenced across Wikipedia, G2, Crunchbase, and Reddit. Brands that build presence on those platforms will compound visibility faster.
  • Real-time freshness wins. Perplexity already weights freshness aggressively. ChatGPT and Claude are following. The 13-week median citation freshness window is shrinking. Static evergreen content without refresh cycles will lose citation share to recently updated competitors, even at lower base authority.

The engines that will win this period are the ones that retrieve aggressively, corroborate carefully, and update fast. The brands that will win are the ones that match those behaviors in their content writing.

What’s next

You now have the technical model. Two next moves.

For the optimization framework that turns this architectural understanding into concrete writing changes, read the complete guide to GEO. It walks through the Citation Trinity in depth and includes a 30-day quick-start plan.

For the strategic positioning question (where does this fit alongside SEO), read GEO vs SEO.

When you are ready to measure how each of these engines is treating your domain, run a free AI visibility audit. We baseline your citation rate engine by engine, so you can see where Bing-based ChatGPT is treating you differently from Sonar-based Perplexity.

The architecture is finally readable. The teams that internalize it will write differently for the next decade. The teams that do not will spend the decade wondering why their best pages are silently absent from the answers their buyers see.

Frequently asked questions

What is RAG and why does it matter for AI search?
RAG stands for Retrieval-Augmented Generation. It is the architecture where a model does not answer from its training memory alone but retrieves live passages from a web index and synthesizes the answer using those passages. ChatGPT Search, Claude with web access, Perplexity, Google AI Overviews and Gemini all use RAG variants.
Does ChatGPT use Google or its own index?
Neither, for live search. ChatGPT Search uses Bing's index. The choice was made when OpenAI partnered with Microsoft in 2019 and deepened in 2023. If your site is unindexed in Bing, ChatGPT Search cannot cite you, regardless of your Google ranking.
How does Perplexity differ from ChatGPT architecturally?
Perplexity runs its own proprietary index called Sonar, built on Llama 3.3, supplemented by third-party sources. It retrieves at the sub-document level (5 to 7 token snippets, roughly 2 to 4 words each) and pulls about 130,000 tokens of relevant snippets per query into the model context. ChatGPT Search retrieves at the page level via Bing.
What is sub-document indexing?
Indexing the web at the granularity of fragments smaller than a page (often paragraphs, sentences, or sub-sentence snippets), rather than indexing whole pages. Perplexity is the most aggressive sub-document indexer in production today.
How often do AI search engines update their indexes?
Varies. Bing (used by ChatGPT) refreshes major sites multiple times per day, smaller sites less often. Perplexity's Sonar refreshes aggressively. Claude does direct live fetches at query time, so freshness depends on whether your page can be fetched.
Does my site need llms.txt?
A small llms.txt file is helpful for Claude and a few other engines that explicitly look for it. It is not strictly required, but it is a low-cost signal that your site is intentional about AI access. Recommended.
Should I block AI crawlers if I do not want to be in training data?
Be careful. Some crawlers (Google-Extended, GPTBot's training-only mode) only collect for training. Others (OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot) are needed for real-time citations. Blocking the latter eliminates you from the citation pool. Block selectively and audit quarterly.
Summarize with Claude
Summarize with Perplexity
Summarize with Google
Summarize with Grok
Summarize with ChatGPT