Summarize this article with AI
Most AI visibility tracking lives in dashboards.The dashboards lie about half the time, because they are showing you a derived signal (citations) without showing you the upstream one (crawls). If you cannot see GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot and PerplexityBot in your access logs, you are reasoning about ChatGPT’s behavior with one eye closed.
The other articles in this cluster cover what to measure, which tool to buy, and how to run a weekly cadence. This one is the plumbing. Server logs, robots.txt, IndexNow, and the crawl-to-citation join. If you ship the four blocks below, you have a real-time view of every AI engine that decides to fetch a page on your site, plus a 24 to 48 hour speedup on ChatGPT retrieval that costs you nothing.
3 classes of AI bots that matter
The user-agent zoo is large, but every bot you handle in 2026 falls into one of three classes. Get this taxonomy right and the rest of the wiring decisions become obvious.
- Training crawlers. Build the corpus the model is trained on (GPTBot, ClaudeBot, CCBot, Amazonbot, meta-externalagent). Effects show up 3 to 12 months later in default knowledge. Robots.txt compliance is universal among reputable vendors.
- Live retrieval crawlers. Fire when a user asks the AI a question (ChatGPT-User, OAI-SearchBot, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, DuckAssistBot, Googlebot for AI Mode). Block these and you disappear from the answer instantly. Always allow the retrieval class.
- Opt-out tokens. Send zero HTTP requests (Google-Extended, Applebot-Extended). They exist only as robots.txt directives. Allow to participate in training, disallow to opt out. They never appear in your logs because they never crawl.
The 17-bot reference table
Every AI crawler and opt-out token a B2B SaaS or content publisher should handle explicitly in 2026. User-agent strings are stable substrings, owners and functions are public, the robots.txt column is our recommended default with the reasoning baked in.
| Bot | User-agent | Owner | Function | robots.txt default |
|---|---|---|---|---|
| GPTBot | GPTBot/1.2 | OpenAI | Training | Allow (train, opt out for IP) |
| ChatGPT-User | ChatGPT-User/1.0 | OpenAI | Live retrieval | Always Allow |
| OAI-SearchBot | OAI-SearchBot/1.0 | OpenAI | ChatGPT Search index | Always Allow |
| ClaudeBot | ClaudeBot/1.0 | Anthropic | Training | Allow (train, opt out for IP) |
| Claude-User | Claude-User/1.0 | Anthropic | Live retrieval | Always Allow |
| Claude-SearchBot | Claude-SearchBot | Anthropic | Retrieval index | Always Allow |
| PerplexityBot | PerplexityBot/1.0 | Perplexity | Retrieval index | Always Allow |
| Perplexity-User | Perplexity-User/1.0 | Perplexity | Live retrieval | Always Allow |
| Google-Extended | (no requests, token only) | Gemini training opt-out | Allow (or Disallow to opt out) | |
| GoogleOther | GoogleOther | R&D / experimental fetch | Allow | |
| Googlebot | Googlebot/2.1 | Search + AI Mode + AI Overviews | Always Allow | |
| CCBot | CCBot/2.0 | Common Crawl | Open web corpus (feeds many LLMs) | Disallow recommended |
| meta-externalagent | meta-externalagent/1.1 | Meta | Meta AI training | Allow (or Disallow to opt out) |
| Bytespider | Bytespider | ByteDance | Doubao LLM training (undocumented) | Disallow + WAF block |
| Amazonbot | Amazonbot/0.1 | Amazon | Alexa + Nova training | Allow + meta noarchive |
| Applebot-Extended | (no requests, token only) | Apple | Apple Intelligence training opt-out | Allow (or Disallow to opt out) |
| DuckAssistBot | DuckAssistBot/1.0 | DuckDuckGo | DuckDuckGo AI Assist | Always Allow |
Sample robots.txt (recommended default)
The maximum-AI-visibility template. This is what we ship to customers who want to be cited everywhere. The publisher template (block training, allow retrieval) is one toggle away, flip the GPTBot, ClaudeBot, meta-externalagent and Amazonbot lines from Allow to Disallow if your legal team requires training opt-out.
# === Live retrieval bots: NEVER block ===
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: DuckAssistBot
Allow: /
User-agent: Googlebot
Allow: /
# === Training crawlers ===
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: meta-externalagent
Allow: /
User-agent: Amazonbot
Allow: /
# === Opt-out tokens ===
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
# === Disallow non-compliant or upstream-leaky ===
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
# === Universal protections ===
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Allow: /
Sitemap: https://yourdomain.com/sitemap.xmlBot logging at the origin
The single most useful 30 minutes of work in this entire playbook. Drop the middleware below, log every AI bot hit to a structured store, and you have a real-time view of which engines are reading which pages. Within 24 hours you have your first GPTBot data point.
Vercel Edge Middleware (Next.js)
Drop this in middleware.ts at the root of your Next.js app. It logs hits, fires a fire-and-forget POST to your analytics endpoint, and adds zero latency to the response.
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';
const AI_BOT_PATTERNS = [
{ name: 'GPTBot', re: /GPTBot\//i, class: 'training' },
{ name: 'ChatGPT-User', re: /ChatGPT-User\//i, class: 'retrieval' },
{ name: 'OAI-SearchBot', re: /OAI-SearchBot\//i, class: 'retrieval' },
{ name: 'ClaudeBot', re: /ClaudeBot\//i, class: 'training' },
{ name: 'Claude-User', re: /Claude-User\//i, class: 'retrieval' },
{ name: 'Claude-SearchBot', re: /Claude-SearchBot/i, class: 'retrieval' },
{ name: 'PerplexityBot', re: /PerplexityBot\//i, class: 'retrieval' },
{ name: 'Perplexity-User', re: /Perplexity-User\//i, class: 'retrieval' },
{ name: 'Googlebot', re: /Googlebot\//i, class: 'retrieval' },
{ name: 'Bingbot', re: /bingbot\//i, class: 'retrieval' },
{ name: 'CCBot', re: /CCBot\//i, class: 'training' },
{ name: 'Amazonbot', re: /Amazonbot\//i, class: 'training' },
{ name: 'Bytespider', re: /Bytespider/i, class: 'training' },
];
export async function middleware(req: NextRequest) {
const ua = req.headers.get('user-agent') ?? '';
const hit = AI_BOT_PATTERNS.find((b) => b.re.test(ua));
if (hit) {
const payload = {
bot: hit.name,
class: hit.class,
url: req.nextUrl.pathname,
ip: req.headers.get('x-real-ip') ?? null,
sigAgent: req.headers.get('signature-agent') ?? null,
ts: Date.now(),
};
fetch('https://collect.yourdomain.com/ai-bot-hit', {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify(payload),
}).catch(() => {});
}
return NextResponse.next();
}
export const config = { matcher: '/:path*' };Cloudflare Workers (any stack)
Same logic in a Workers fetch handler. The bonus here: Cloudflare exposes request.cf.verified_bot_category which validates IP and behavior, defeating UA spoofing automatically. Categories: AI Crawler, AI Search, AI Assistant.
export default {
async fetch(request, env, ctx) {
const cf = request.cf || {};
const ua = request.headers.get('user-agent') || '';
const aiCategory = cf.verified_bot_category;
if (aiCategory && /AI/.test(aiCategory)) {
const payload = {
category: aiCategory,
ua,
url: new URL(request.url).pathname,
sigAgent: request.headers.get('signature-agent') ?? null,
verifiedBot: cf.verifiedBot ?? false,
ts: Date.now(),
};
ctx.waitUntil(
fetch(env.COLLECT_ENDPOINT, {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify(payload),
}).catch(() => {})
);
}
return fetch(request);
},
};The IndexNow shortcut
IndexNow is the fastest, cheapest tactic in 2026 GEO. Submit new or updated URLs to a single endpoint, Bing revisits within 5 to 30 minutes, and ChatGPT-User picks them up on its next retrieval window. Typical observed speedup over organic discovery on the Bing-to-ChatGPT chain: 24 to 48 hours.
- Endpoint:
POST https://api.indexnow.org/indexnow - Key file: any 8 to 128 hex chars, hosted at site root:
https://yourdomain.com/<key>.txt - Bulk limit: up to 10,000 URLs per request
- Free fan-out: a single submission propagates to Bing, Yandex, Seznam and Naver
POST https://api.indexnow.org/indexnow
Content-Type: application/json
{
"host": "yourdomain.com",
"key": "a1b2c3d4e5f6",
"keyLocation": "https://yourdomain.com/a1b2c3d4e5f6.txt",
"urlList": [
"https://yourdomain.com/blog/post-1",
"https://yourdomain.com/pricing",
"https://yourdomain.com/vs-competitor"
]
}Wire IndexNow into your CMS publish hook. Every time a page is created, updated, or its dateModified changes, fire the POST. For time-sensitive content (pricing pages, product announcements, comparison pages), this is the highest- impact technical lever in your stack.
5-step technical wiring playbook
Run these in order, allow about half a day end to end. After step 5 you have the only thing in this entire stack that no dashboard tool reconstructs for you: a defensible link from crawl event to citation outcome.
Install bot logging
Configure robots.txt
Add IndexNow
Set up the dashboard
Tie crawl to citation
Tying crawl to citation
The join everyone misses. Your bot log says OAI-SearchBot fetched /pricing at 14:32 on Tuesday. Your citation tracker says ChatGPT cited /pricing at 09:14 on Thursday. The 67-hour gap is the retrieval signal. Three patterns matter, each with a different action.
| Pattern | What it means | Action |
|---|---|---|
| Crawl in last 72h, then citation | Active retrieval. Your content is being fetched and chosen. | Double down on the page shape. Replicate on adjacent pages. |
| No crawl, but citation appears | The model is using stale training data, your passage was indexed long ago. | Switch the page to retrieval-friendly patterns (FAQ schema, dated changelog block, named sources). Push via IndexNow. |
| Crawl in last 72h, no citation | The model fetched but did not pick. Your passage shape is the bottleneck, not your wiring. | Rewrite the first 80 words of the H2 most likely targeted by the prompt. Re-measure in 7 days. |
| Neither crawl nor citation | The page is invisible. Either the URL is not indexable or the prompt panel does not include this category. | Check robots.txt, sitemap, and Bing Webmaster. Verify the category is in your prompt panel. |
Citation without crawl is luck. Crawl without citation is a rewrite. Both signals beat the dashboard average, because they tell you what to do, not just what is happening.
Where to go deeper
Wiring is one block of the broader GEO stack. The companion articles cover what to measure, the tool comparison, and the weekly cadence the wiring feeds into.
- GEO Tools and Analytics: The Complete Measurement Guide sits one level above this article. Pick your metrics first, then come back here to wire the data layer.
- Best GEO Tools 2026: Honest Teardown of 9 Platforms covers the dashboard layer that consumes this wiring. Profound and Scrunch have the strongest bot-crawl visibility, other tools rely on you wiring it yourself.
- How to Measure GEO Performance: The Weekly Operator’s Playbook runs the weekly cadence that benchmarks this data against citation outcomes. Wire first, measure second.
- How to Optimize for ChatGPT Search: The Bing-First Playbook goes deeper on the ChatGPT-Bing chain that IndexNow exploits, with the 7-tactic playbook for ChatGPT specifically.
Tools tell you what happened. Wiring tells you why. Both are useful. Only one is reproducible without a vendor invoice.







