Clairon

How to Track AI Search Visibility: The Technical Wiring Guide for 2026

Hugo Debrabandere

Hugo Debrabandere

Co-founder · Clairon

Apr 29, 2026

Most AI visibility tracking lives in dashboards.The dashboards lie about half the time, because they are showing you a derived signal (citations) without showing you the upstream one (crawls). If you cannot see GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot and PerplexityBot in your access logs, you are reasoning about ChatGPT’s behavior with one eye closed.

The other articles in this cluster cover what to measure, which tool to buy, and how to run a weekly cadence. This one is the plumbing. Server logs, robots.txt, IndexNow, and the crawl-to-citation join. If you ship the four blocks below, you have a real-time view of every AI engine that decides to fetch a page on your site, plus a 24 to 48 hour speedup on ChatGPT retrieval that costs you nothing.

87%
of ChatGPT citations match Bing's top 10 organic (Seer Interactive, 2025)
24-48h
typical IndexNow speedup vs organic discovery on Bing → ChatGPT chain
17
AI crawlers and opt-out tokens worth handling explicitly in robots.txt

3 classes of AI bots that matter

The user-agent zoo is large, but every bot you handle in 2026 falls into one of three classes. Get this taxonomy right and the rest of the wiring decisions become obvious.

  1. Training crawlers. Build the corpus the model is trained on (GPTBot, ClaudeBot, CCBot, Amazonbot, meta-externalagent). Effects show up 3 to 12 months later in default knowledge. Robots.txt compliance is universal among reputable vendors.
  2. Live retrieval crawlers. Fire when a user asks the AI a question (ChatGPT-User, OAI-SearchBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, DuckAssistBot, Googlebot for AI Mode). Block these and you disappear from the answer instantly. Always allow the retrieval class.
  3. Opt-out tokens. Send zero HTTP requests (Google-Extended, Applebot-Extended). They exist only as robots.txt directives. Allow to participate in training, disallow to opt out. They never appear in your logs because they never crawl.

The 17-bot reference table

Every AI crawler and opt-out token a B2B SaaS or content publisher should handle explicitly in 2026. User-agent strings are stable substrings, owners and functions are public, the robots.txt column is our recommended default with the reasoning baked in.

17 AI bots, 2026
BotUser-agentOwnerFunctionrobots.txt default
GPTBotGPTBot/1.2OpenAITrainingAllow (train, opt out for IP)
ChatGPT-UserChatGPT-User/1.0OpenAILive retrievalAlways Allow
OAI-SearchBotOAI-SearchBot/1.0OpenAIChatGPT Search indexAlways Allow
ClaudeBotClaudeBot/1.0AnthropicTrainingAllow (train, opt out for IP)
Claude-UserClaude-User/1.0AnthropicLive retrievalAlways Allow
Claude-SearchBotClaude-SearchBotAnthropicRetrieval indexAlways Allow
PerplexityBotPerplexityBot/1.0PerplexityRetrieval indexAlways Allow
Perplexity-UserPerplexity-User/1.0PerplexityLive retrievalAlways Allow
Google-Extended(no requests, token only)GoogleGemini training opt-outAllow (or Disallow to opt out)
GoogleOtherGoogleOtherGoogleR&D / experimental fetchAllow
GooglebotGooglebot/2.1GoogleSearch + AI Mode + AI OverviewsAlways Allow
CCBotCCBot/2.0Common CrawlOpen web corpus (feeds many LLMs)Disallow recommended
meta-externalagentmeta-externalagent/1.1MetaMeta AI trainingAllow (or Disallow to opt out)
BytespiderBytespiderByteDanceDoubao LLM training (undocumented)Disallow + WAF block
AmazonbotAmazonbot/0.1AmazonAlexa + Nova trainingAllow + meta noarchive
Applebot-Extended(no requests, token only)AppleApple Intelligence training opt-outAllow (or Disallow to opt out)
DuckAssistBotDuckAssistBot/1.0DuckDuckGoDuckDuckGo AI AssistAlways Allow

Sample robots.txt (recommended default)

The maximum-AI-visibility template. This is what we ship to customers who want to be cited everywhere. The publisher template (block training, allow retrieval) is one toggle away, flip the GPTBot, ClaudeBot, meta-externalagent and Amazonbot lines from Allow to Disallow if your legal team requires training opt-out.

text
# === Live retrieval bots: NEVER block ===
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: DuckAssistBot
Allow: /

User-agent: Googlebot
Allow: /

# === Training crawlers ===
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: meta-externalagent
Allow: /

User-agent: Amazonbot
Allow: /

# === Opt-out tokens ===
User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

# === Disallow non-compliant or upstream-leaky ===
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

# === Universal protections ===
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Bot logging at the origin

The single most useful 30 minutes of work in this entire playbook. Drop the middleware below, log every AI bot hit to a structured store, and you have a real-time view of which engines are reading which pages. Within 24 hours you have your first GPTBot data point.

Vercel Edge Middleware (Next.js)

Drop this in middleware.ts at the root of your Next.js app. It logs hits, fires a fire-and-forget POST to your analytics endpoint, and adds zero latency to the response.

typescript
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';

const AI_BOT_PATTERNS = [
  { name: 'GPTBot',           re: /GPTBot\//i,          class: 'training'  },
  { name: 'ChatGPT-User',     re: /ChatGPT-User\//i,    class: 'retrieval' },
  { name: 'OAI-SearchBot',    re: /OAI-SearchBot\//i,   class: 'retrieval' },
  { name: 'ClaudeBot',        re: /ClaudeBot\//i,       class: 'training'  },
  { name: 'Claude-User',      re: /Claude-User\//i,     class: 'retrieval' },
  { name: 'Claude-SearchBot', re: /Claude-SearchBot/i,   class: 'retrieval' },
  { name: 'PerplexityBot',    re: /PerplexityBot\//i,   class: 'retrieval' },
  { name: 'Perplexity-User',  re: /Perplexity-User\//i, class: 'retrieval' },
  { name: 'Googlebot',        re: /Googlebot\//i,       class: 'retrieval' },
  { name: 'Bingbot',          re: /bingbot\//i,         class: 'retrieval' },
  { name: 'CCBot',            re: /CCBot\//i,           class: 'training'  },
  { name: 'Amazonbot',        re: /Amazonbot\//i,       class: 'training'  },
  { name: 'Bytespider',       re: /Bytespider/i,         class: 'training'  },
];

export async function middleware(req: NextRequest) {
  const ua = req.headers.get('user-agent') ?? '';
  const hit = AI_BOT_PATTERNS.find((b) => b.re.test(ua));

  if (hit) {
    const payload = {
      bot: hit.name,
      class: hit.class,
      url: req.nextUrl.pathname,
      ip: req.headers.get('x-real-ip') ?? null,
      sigAgent: req.headers.get('signature-agent') ?? null,
      ts: Date.now(),
    };
    fetch('https://collect.yourdomain.com/ai-bot-hit', {
      method: 'POST',
      headers: { 'content-type': 'application/json' },
      body: JSON.stringify(payload),
    }).catch(() => {});
  }

  return NextResponse.next();
}

export const config = { matcher: '/:path*' };

Cloudflare Workers (any stack)

Same logic in a Workers fetch handler. The bonus here: Cloudflare exposes request.cf.verified_bot_category which validates IP and behavior, defeating UA spoofing automatically. Categories: AI Crawler, AI Search, AI Assistant.

javascript
export default {
  async fetch(request, env, ctx) {
    const cf = request.cf || {};
    const ua = request.headers.get('user-agent') || '';
    const aiCategory = cf.verified_bot_category;

    if (aiCategory && /AI/.test(aiCategory)) {
      const payload = {
        category: aiCategory,
        ua,
        url: new URL(request.url).pathname,
        sigAgent: request.headers.get('signature-agent') ?? null,
        verifiedBot: cf.verifiedBot ?? false,
        ts: Date.now(),
      };
      ctx.waitUntil(
        fetch(env.COLLECT_ENDPOINT, {
          method: 'POST',
          headers: { 'content-type': 'application/json' },
          body: JSON.stringify(payload),
        }).catch(() => {})
      );
    }

    return fetch(request);
  },
};

The IndexNow shortcut

IndexNow is the fastest, cheapest tactic in 2026 GEO. Submit new or updated URLs to a single endpoint, Bing revisits within 5 to 30 minutes, and ChatGPT-User picks them up on its next retrieval window. Typical observed speedup over organic discovery on the Bing-to-ChatGPT chain: 24 to 48 hours.

  • Endpoint: POST https://api.indexnow.org/indexnow
  • Key file: any 8 to 128 hex chars, hosted at site root: https://yourdomain.com/<key>.txt
  • Bulk limit: up to 10,000 URLs per request
  • Free fan-out: a single submission propagates to Bing, Yandex, Seznam and Naver
json
POST https://api.indexnow.org/indexnow
Content-Type: application/json

{
  "host": "yourdomain.com",
  "key": "a1b2c3d4e5f6",
  "keyLocation": "https://yourdomain.com/a1b2c3d4e5f6.txt",
  "urlList": [
    "https://yourdomain.com/blog/post-1",
    "https://yourdomain.com/pricing",
    "https://yourdomain.com/vs-competitor"
  ]
}

Wire IndexNow into your CMS publish hook. Every time a page is created, updated, or its dateModified changes, fire the POST. For time-sensitive content (pricing pages, product announcements, comparison pages), this is the highest- impact technical lever in your stack.

5-step technical wiring playbook

Run these in order, allow about half a day end to end. After step 5 you have the only thing in this entire stack that no dashboard tool reconstructs for you: a defensible link from crawl event to citation outcome.

Install bot logging

Drop the middleware/Worker above. Log every AI bot hit to a structured store (Postgres, BigQuery, ClickHouse) with bot, class, url, ip, signature_agent, ts. Within 24 hours you have your first GPTBot and ChatGPT-User hits.

Configure robots.txt

Deploy the maximum-visibility template. Verify each retrieval bot can fetch by curl: curl -A "ChatGPT-User/1.0" https://yourdomain.com/robots.txt. The most common bug, a stale Disallow: / on a retrieval bot, surfaces here.

Add IndexNow

Generate the key file, host at site root, and POST every new/updated URL to api.indexnow.org/indexnow. Wire into the CMS publish hook so every save fires a submission. ChatGPT speedup begins immediately.

Set up the dashboard

One view: rows = URLs, columns = (last GPTBot hit, last ChatGPT-User hit, last OAI-SearchBot hit, last Perplexity-User hit, IndexNow last submitted). This becomes the operator's daily standup.

Tie crawl to citation

When your citation tracker reports a citation for /blog/foo, join it back to the bot-hit log. Pattern: OAI-SearchBot crawl → citation in ChatGPT within 24 to 72 hours. Citations without prior crawl mean stale training, fix passage shape. Crawls without citations mean passage shape is off, not wiring.

Tying crawl to citation

The join everyone misses. Your bot log says OAI-SearchBot fetched /pricing at 14:32 on Tuesday. Your citation tracker says ChatGPT cited /pricing at 09:14 on Thursday. The 67-hour gap is the retrieval signal. Three patterns matter, each with a different action.

Crawl-to-citation patterns and actions
PatternWhat it meansAction
Crawl in last 72h, then citationActive retrieval. Your content is being fetched and chosen.Double down on the page shape. Replicate on adjacent pages.
No crawl, but citation appearsThe model is using stale training data, your passage was indexed long ago.Switch the page to retrieval-friendly patterns (FAQ schema, dated changelog block, named sources). Push via IndexNow.
Crawl in last 72h, no citationThe model fetched but did not pick. Your passage shape is the bottleneck, not your wiring.Rewrite the first 80 words of the H2 most likely targeted by the prompt. Re-measure in 7 days.
Neither crawl nor citationThe page is invisible. Either the URL is not indexable or the prompt panel does not include this category.Check robots.txt, sitemap, and Bing Webmaster. Verify the category is in your prompt panel.
Citation without crawl is luck. Crawl without citation is a rewrite. Both signals beat the dashboard average, because they tell you what to do, not just what is happening.
Internal Clairon playbook·Technical wiring principle #4

Where to go deeper

Wiring is one block of the broader GEO stack. The companion articles cover what to measure, the tool comparison, and the weekly cadence the wiring feeds into.

Tools tell you what happened. Wiring tells you why. Both are useful. Only one is reproducible without a vendor invoice.

Frequently asked questions

Do AI bots respect robots.txt in 2026?
Most reputable bots do, including OpenAI (GPTBot, ChatGPT-User, OAI-SearchBot), Anthropic (ClaudeBot, Claude-User), Perplexity, and Google's training crawlers. Google's user-triggered fetchers explicitly do not. Bytespider and xAI's Grok crawler have documented non-compliance. The pragmatic answer: state your policy in robots.txt, then enforce non-compliance at the WAF for the bots that ignore it.
What is the difference between GPTBot and ChatGPT-User?
GPTBot is the offline training crawler. Its effects show up 3 to 12 months later in the model's default knowledge. ChatGPT-User is the live, per-query retrieval crawler that fires when a user asks ChatGPT a question and the model decides to fetch a source. Blocking ChatGPT-User makes you instantly invisible in ChatGPT answers, blocking GPTBot only affects future training cycles. The two are not interchangeable in robots.txt.
Should I block Google-Extended?
Only if you want to opt out of Gemini and Vertex training. Google-Extended is an opt-out token, not a crawler. It sends zero HTTP requests, it only exists as a robots.txt directive. Blocking it does not affect Google Search, Googlebot, or AI Overviews ranking. Allow it by default in 2026 because Gemini citation share is non-trivial. Disallow only if legal or licensing demands it.
How do I detect spoofed AI bots?
User-agent strings are trivially spoofable. Three layers of defense, in order of effort: cross-check the source IP against vendor-published JSON files (openai.com/gptbot.json, perplexity.com/perplexitybot.json), use Cloudflare's verified-bot category field which validates IP and behavior, or accept signed Signature-Agent headers (Web Bot Auth, RFC 9421) where supported. UA-only filtering is no longer enough as of 2026.
Does IndexNow work for ChatGPT directly?
No. ChatGPT does not run IndexNow. But ChatGPT browse pulls from Bing's index, and roughly 87% of ChatGPT citations match Bing's top 10 organic results (Seer Interactive, 2025). An IndexNow ping to Bing typically surfaces in ChatGPT 24 to 48 hours sooner than waiting for organic discovery. The chain is IndexNow to Bing to ChatGPT, not IndexNow to ChatGPT directly.
What is the minimum log-parsing setup if I do not have Cloudflare?
A single awk one-liner on Nginx or Apache access logs is enough to count: awk -F'"' '{print $6}' access.log | grep -iE "gptbot|chatgpt-user|oai-searchbot|claudebot|perplexitybot" | sort | uniq -c | sort -rn. Run nightly via cron, ship deltas to Slack. You will have your first GPTBot hits within 24 hours and your first ChatGPT-User hits within a week on most B2B SaaS sites.
Do I need llms.txt as well as robots.txt?
robots.txt first, llms.txt as a bonus. robots.txt is the only mechanism with broad bot compliance today, every major AI crawler reads it. llms.txt is a markdown summary file pointed at LLMs as a hint about which content to prioritize. Useful, not yet load-bearing. Ship robots.txt correctly first, then add llms.txt at /llms.txt with your most citable pages and definitions linked in priority order.
How do I correlate a bot hit with a citation outcome?
Join your bot-hit log (URL, bot, timestamp) to your citation tracker output (URL, AI engine, prompt, timestamp). The signal pattern: a retrieval-class crawl (OAI-SearchBot, PerplexityBot, ChatGPT-User) within the 72-hour window before a citation means your content is being actively retrieved. Citations without prior crawl mean the model is using stale training data, switch to retrieval-friendly content patterns. Crawls without citations mean your passage shape is off, not your wiring.
Summarize with Claude
Summarize with Perplexity
Summarize with Google
Summarize with Grok
Summarize with ChatGPT