How to Track AI Search Visibility: Technical Wiring 2026

Summarize this article with AI

Ask with PerplexityOpen with this article

Most AI visibility tracking lives in dashboards.The dashboards lie about half the time, because they are showing you a derived signal (citations) without showing you the upstream one (crawls). If you cannot see GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot and PerplexityBot in your access logs, you are reasoning about ChatGPT’s behavior with one eye closed.

The other articles in this cluster cover what to measure, which tool to buy, and how to run a weekly cadence. This one is the plumbing. Server logs, robots.txt, IndexNow, and the crawl-to-citation join. If you ship the four blocks below, you have a real-time view of every AI engine that decides to fetch a page on your site, plus a 24 to 48 hour speedup on ChatGPT retrieval that costs you nothing.

87%

of ChatGPT citations match Bing's top 10 organic (Seer Interactive, 2025)

24-48h

typical IndexNow speedup vs organic discovery on Bing → ChatGPT chain

AI crawlers and opt-out tokens worth handling explicitly in robots.txt

3 classes of AI bots that matter

The user-agent zoo is large, but every bot you handle in 2026 falls into one of three classes. Get this taxonomy right and the rest of the wiring decisions become obvious.

Training crawlers. Build the corpus the model is trained on (GPTBot, ClaudeBot, CCBot, Amazonbot, meta-externalagent). Effects show up 3 to 12 months later in default knowledge. Robots.txt compliance is universal among reputable vendors.
Live retrieval crawlers. Fire when a user asks the AI a question (ChatGPT-User, OAI-SearchBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, DuckAssistBot, Googlebot for AI Mode). Block these and you disappear from the answer instantly. Always allow the retrieval class.
Opt-out tokens. Send zero HTTP requests (Google-Extended, Applebot-Extended). They exist only as robots.txt directives. Allow to participate in training, disallow to opt out. They never appear in your logs because they never crawl.

The 17-bot reference table

Every AI crawler and opt-out token a B2B SaaS or content publisher should handle explicitly in 2026. User-agent strings are stable substrings, owners and functions are public, the robots.txt column is our recommended default with the reasoning baked in.

17 AI bots, 2026

Bot	User-agent	Owner	Function	robots.txt default
GPTBot	GPTBot/1.2	OpenAI	Training	Allow (train, opt out for IP)
ChatGPT-User	ChatGPT-User/1.0	OpenAI	Live retrieval	Always Allow
OAI-SearchBot	OAI-SearchBot/1.0	OpenAI	ChatGPT Search index	Always Allow
ClaudeBot	ClaudeBot/1.0	Anthropic	Training	Allow (train, opt out for IP)
Claude-User	Claude-User/1.0	Anthropic	Live retrieval	Always Allow
Claude-SearchBot	Claude-SearchBot	Anthropic	Retrieval index	Always Allow
PerplexityBot	PerplexityBot/1.0	Perplexity	Retrieval index	Always Allow
Perplexity-User	Perplexity-User/1.0	Perplexity	Live retrieval	Always Allow
Google-Extended	(no requests, token only)	Google	Gemini training opt-out	Allow (or Disallow to opt out)
GoogleOther	GoogleOther	Google	R&D / experimental fetch	Allow
Googlebot	Googlebot/2.1	Google	Search + AI Mode + AI Overviews	Always Allow
CCBot	CCBot/2.0	Common Crawl	Open web corpus (feeds many LLMs)	Disallow recommended
meta-externalagent	meta-externalagent/1.1	Meta	Meta AI training	Allow (or Disallow to opt out)
Bytespider	Bytespider	ByteDance	Doubao LLM training (undocumented)	Disallow + WAF block
Amazonbot	Amazonbot/0.1	Amazon	Alexa + Nova training	Allow + meta noarchive
Applebot-Extended	(no requests, token only)	Apple	Apple Intelligence training opt-out	Allow (or Disallow to opt out)
DuckAssistBot	DuckAssistBot/1.0	DuckDuckGo	DuckDuckGo AI Assist	Always Allow

Sample robots.txt (recommended default)

The maximum-AI-visibility template. This is what we ship to customers who want to be cited everywhere. The publisher template (block training, allow retrieval) is one toggle away, flip the GPTBot, ClaudeBot, meta-externalagent and Amazonbot lines from Allow to Disallow if your legal team requires training opt-out.

text

# === Live retrieval bots: NEVER block ===
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: DuckAssistBot
Allow: /

User-agent: Googlebot
Allow: /

# === Training crawlers ===
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: meta-externalagent
Allow: /

User-agent: Amazonbot
Allow: /

# === Opt-out tokens ===
User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

# === Disallow non-compliant or upstream-leaky ===
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# === Universal protections ===
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Bot logging at the origin

The single most useful 30 minutes of work in this entire playbook. Drop the middleware below, log every AI bot hit to a structured store, and you have a real-time view of which engines are reading which pages. Within 24 hours you have your first GPTBot data point.

Vercel Edge Middleware (Next.js)

Drop this in middleware.ts at the root of your Next.js app. It logs hits, fires a fire-and-forget POST to your analytics endpoint, and adds zero latency to the response.

typescript

import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';

const AI_BOT_PATTERNS = [
  { name: 'GPTBot',           re: /GPTBot\//i,          class: 'training'  },
  { name: 'ChatGPT-User',     re: /ChatGPT-User\//i,    class: 'retrieval' },
  { name: 'OAI-SearchBot',    re: /OAI-SearchBot\//i,   class: 'retrieval' },
  { name: 'ClaudeBot',        re: /ClaudeBot\//i,       class: 'training'  },
  { name: 'Claude-User',      re: /Claude-User\//i,     class: 'retrieval' },
  { name: 'Claude-SearchBot', re: /Claude-SearchBot/i,   class: 'retrieval' },
  { name: 'PerplexityBot',    re: /PerplexityBot\//i,   class: 'retrieval' },
  { name: 'Perplexity-User',  re: /Perplexity-User\//i, class: 'retrieval' },
  { name: 'Googlebot',        re: /Googlebot\//i,       class: 'retrieval' },
  { name: 'Bingbot',          re: /bingbot\//i,         class: 'retrieval' },
  { name: 'CCBot',            re: /CCBot\//i,           class: 'training'  },
  { name: 'Amazonbot',        re: /Amazonbot\//i,       class: 'training'  },
  { name: 'Bytespider',       re: /Bytespider/i,         class: 'training'  },
];

export async function middleware(req: NextRequest) {
  const ua = req.headers.get('user-agent') ?? '';
  const hit = AI_BOT_PATTERNS.find((b) => b.re.test(ua));

  if (hit) {
    const payload = {
      bot: hit.name,
      class: hit.class,
      url: req.nextUrl.pathname,
      ip: req.headers.get('x-real-ip') ?? null,
      sigAgent: req.headers.get('signature-agent') ?? null,
      ts: Date.now(),
    };
    fetch('https://collect.yourdomain.com/ai-bot-hit', {
      method: 'POST',
      headers: { 'content-type': 'application/json' },
      body: JSON.stringify(payload),
    }).catch(() => {});
  }

  return NextResponse.next();
}

export const config = { matcher: '/:path*' };

Cloudflare Workers (any stack)

Same logic in a Workers fetch handler. The bonus here: Cloudflare exposes request.cf.verified_bot_category which validates IP and behavior, defeating UA spoofing automatically. Categories: AI Crawler, AI Search, AI Assistant.

javascript

export default {
  async fetch(request, env, ctx) {
    const cf = request.cf || {};
    const ua = request.headers.get('user-agent') || '';
    const aiCategory = cf.verified_bot_category;

    if (aiCategory && /AI/.test(aiCategory)) {
      const payload = {
        category: aiCategory,
        ua,
        url: new URL(request.url).pathname,
        sigAgent: request.headers.get('signature-agent') ?? null,
        verifiedBot: cf.verifiedBot ?? false,
        ts: Date.now(),
      };
      ctx.waitUntil(
        fetch(env.COLLECT_ENDPOINT, {
          method: 'POST',
          headers: { 'content-type': 'application/json' },
          body: JSON.stringify(payload),
        }).catch(() => {})
      );
    }

    return fetch(request);
  },
};

The IndexNow shortcut

IndexNow is the fastest, cheapest tactic in 2026 GEO. Submit new or updated URLs to a single endpoint, Bing revisits within 5 to 30 minutes, and ChatGPT-User picks them up on its next retrieval window. Typical observed speedup over organic discovery on the Bing-to-ChatGPT chain: 24 to 48 hours.

Endpoint: POST https://api.indexnow.org/indexnow
Key file: any 8 to 128 hex chars, hosted at site root: https://yourdomain.com/<key>.txt
Bulk limit: up to 10,000 URLs per request
Free fan-out: a single submission propagates to Bing, Yandex, Seznam and Naver

json

POST https://api.indexnow.org/indexnow
Content-Type: application/json

{
  "host": "yourdomain.com",
  "key": "a1b2c3d4e5f6",
  "keyLocation": "https://yourdomain.com/a1b2c3d4e5f6.txt",
  "urlList": [
    "https://yourdomain.com/blog/post-1",
    "https://yourdomain.com/pricing",
    "https://yourdomain.com/vs-competitor"
  ]
}

Wire IndexNow into your CMS publish hook. Every time a page is created, updated, or its dateModified changes, fire the POST. For time-sensitive content (pricing pages, product announcements, comparison pages), this is the highest- impact technical lever in your stack.

5-step technical wiring playbook

Run these in order, allow about half a day end to end. After step 5 you have the only thing in this entire stack that no dashboard tool reconstructs for you: a defensible link from crawl event to citation outcome.

Install bot logging

Drop the middleware/Worker above. Log every AI bot hit to a structured store (Postgres, BigQuery, ClickHouse) with bot, class, url, ip, signature_agent, ts. Within 24 hours you have your first GPTBot and ChatGPT-User hits.

Configure robots.txt

Deploy the maximum-visibility template. Verify each retrieval bot can fetch by curl: curl -A "ChatGPT-User/1.0" https://yourdomain.com/robots.txt. The most common bug, a stale Disallow: / on a retrieval bot, surfaces here.

Add IndexNow

Generate the key file, host at site root, and POST every new/updated URL to api.indexnow.org/indexnow. Wire into the CMS publish hook so every save fires a submission. ChatGPT speedup begins immediately.

Set up the dashboard

One view: rows = URLs, columns = (last GPTBot hit, last ChatGPT-User hit, last OAI-SearchBot hit, last Perplexity-User hit, IndexNow last submitted). This becomes the operator's daily standup.

Tie crawl to citation

When your citation tracker reports a citation for /blog/foo, join it back to the bot-hit log. Pattern: OAI-SearchBot crawl → citation in ChatGPT within 24 to 72 hours. Citations without prior crawl mean stale training, fix passage shape. Crawls without citations mean passage shape is off, not wiring.

Tying crawl to citation

The join everyone misses. Your bot log says OAI-SearchBot fetched /pricing at 14:32 on Tuesday. Your citation tracker says ChatGPT cited /pricing at 09:14 on Thursday. The 67-hour gap is the retrieval signal. Three patterns matter, each with a different action.

Crawl-to-citation patterns and actions

Pattern	What it means	Action
Crawl in last 72h, then citation	Active retrieval. Your content is being fetched and chosen.	Double down on the page shape. Replicate on adjacent pages.
No crawl, but citation appears	The model is using stale training data, your passage was indexed long ago.	Switch the page to retrieval-friendly patterns (FAQ schema, dated changelog block, named sources). Push via IndexNow.
Crawl in last 72h, no citation	The model fetched but did not pick. Your passage shape is the bottleneck, not your wiring.	Rewrite the first 80 words of the H2 most likely targeted by the prompt. Re-measure in 7 days.
Neither crawl nor citation	The page is invisible. Either the URL is not indexable or the prompt panel does not include this category.	Check robots.txt, sitemap, and Bing Webmaster. Verify the category is in your prompt panel.

Citation without crawl is luck. Crawl without citation is a rewrite. Both signals beat the dashboard average, because they tell you what to do, not just what is happening.

Internal Clairon playbook·Technical wiring principle #4

Where to go deeper

Wiring is one block of the broader GEO stack. The companion articles cover what to measure, the tool comparison, and the weekly cadence the wiring feeds into.

GEO Tools and Analytics: The Complete Measurement Guide sits one level above this article. Pick your metrics first, then come back here to wire the data layer.
Best GEO Tools 2026: Honest Teardown of 9 Platforms covers the dashboard layer that consumes this wiring. Profound and Scrunch have the strongest bot-crawl visibility, other tools rely on you wiring it yourself.
How to Measure GEO Performance: The Weekly Operator’s Playbook runs the weekly cadence that benchmarks this data against citation outcomes. Wire first, measure second.
How to Optimize for ChatGPT Search: The Bing-First Playbook goes deeper on the ChatGPT-Bing chain that IndexNow exploits, with the 7-tactic playbook for ChatGPT specifically.

Tools tell you what happened. Wiring tells you why. Both are useful. Only one is reproducible without a vendor invoice.

Start your 7-day trial

Frequently asked questions

Do AI bots respect robots.txt in 2026?

Most reputable bots do, including OpenAI (GPTBot, ChatGPT-User, OAI-SearchBot), Anthropic (ClaudeBot, Claude-User), Perplexity, and Google's training crawlers. Google's user-triggered fetchers explicitly do not. Bytespider and xAI's Grok crawler have documented non-compliance. The pragmatic answer: state your policy in robots.txt, then enforce non-compliance at the WAF for the bots that ignore it.

What is the difference between GPTBot and ChatGPT-User?

GPTBot is the offline training crawler. Its effects show up 3 to 12 months later in the model's default knowledge. ChatGPT-User is the live, per-query retrieval crawler that fires when a user asks ChatGPT a question and the model decides to fetch a source. Blocking ChatGPT-User makes you instantly invisible in ChatGPT answers, blocking GPTBot only affects future training cycles. The two are not interchangeable in robots.txt.

Should I block Google-Extended?

Only if you want to opt out of Gemini and Vertex training. Google-Extended is an opt-out token, not a crawler. It sends zero HTTP requests, it only exists as a robots.txt directive. Blocking it does not affect Google Search, Googlebot, or AI Overviews ranking. Allow it by default in 2026 because Gemini citation share is non-trivial. Disallow only if legal or licensing demands it.

How do I detect spoofed AI bots?

User-agent strings are trivially spoofable. Three layers of defense, in order of effort: cross-check the source IP against vendor-published JSON files (openai.com/gptbot.json, perplexity.com/perplexitybot.json), use Cloudflare's verified-bot category field which validates IP and behavior, or accept signed Signature-Agent headers (Web Bot Auth, RFC 9421) where supported. UA-only filtering is no longer enough as of 2026.

Does IndexNow work for ChatGPT directly?

No. ChatGPT does not run IndexNow. But ChatGPT browse pulls from Bing's index, and roughly 87% of ChatGPT citations match Bing's top 10 organic results (Seer Interactive, 2025). An IndexNow ping to Bing typically surfaces in ChatGPT 24 to 48 hours sooner than waiting for organic discovery. The chain is IndexNow to Bing to ChatGPT, not IndexNow to ChatGPT directly.

What is the minimum log-parsing setup if I do not have Cloudflare?

A single awk one-liner on Nginx or Apache access logs is enough to count: awk -F'"' '{print $6}' access.log | grep -iE "gptbot|chatgpt-user|oai-searchbot|claudebot|perplexitybot" | sort | uniq -c | sort -rn. Run nightly via cron, ship deltas to Slack. You will have your first GPTBot hits within 24 hours and your first ChatGPT-User hits within a week on most B2B SaaS sites.

Do I need llms.txt as well as robots.txt?

robots.txt first, llms.txt as a bonus. robots.txt is the only mechanism with broad bot compliance today, every major AI crawler reads it. llms.txt is a markdown summary file pointed at LLMs as a hint about which content to prioritize. Useful, not yet load-bearing. Ship robots.txt correctly first, then add llms.txt at /llms.txt with your most citable pages and definitions linked in priority order.

How do I correlate a bot hit with a citation outcome?

Join your bot-hit log (URL, bot, timestamp) to your citation tracker output (URL, AI engine, prompt, timestamp). The signal pattern: a retrieval-class crawl (OAI-SearchBot, PerplexityBot, ChatGPT-User) within the 72-hour window before a citation means your content is being actively retrieved. Citations without prior crawl mean the model is using stale training data, switch to retrieval-friendly content patterns. Crawls without citations mean your passage shape is off, not your wiring.

How to Track AI Search Visibility: The Technical Wiring Guide for 2026