Summarize this article with AI
The most expensive GEO mistake of 2025: silent AI crawler blocks. Cloudflare changed its default in 2024 to block AI bots. Vercel ships strict bot rules. WordPress security plugins block aggressively. 22 of the 60+ B2B SaaS sites we audited had at least one AI crawler blocked without anyone on the team realizing. Every page rewrite, schema build and Reddit reply is wasted if the engine cannot fetch your URL.
Below: the 8 AI bots you need to know, the training vs retrieval distinction (block one, allow the other), the exact robots.txt allowlist, the 5-minute audit, and the Cloudflare / Vercel / WordPress traps that catch most teams.
The 8 AI bots you need to know
| Bot user-agent | Owner | Purpose | Block? |
|---|---|---|---|
| GPTBot | OpenAI | Training (ChatGPT) | Optional |
| OAI-SearchBot | OpenAI | Retrieval (live ChatGPT search) | Never block |
| ChatGPT-User | OpenAI | User-triggered fetches | Allow |
| ClaudeBot | Anthropic | Training (Claude) | Optional |
| Claude-User | Anthropic | User-triggered fetches | Allow |
| Claude-SearchBot | Anthropic | Retrieval (Claude search) | Never block |
| PerplexityBot | Perplexity | Retrieval and indexing | Never block |
| Google-Extended | Training (Gemini, AIO) | Optional |
Three additional bots worth knowing: Bingbot (feeds ChatGPT via the OpenAI/Microsoft partnership, never block), Applebot-Extended (Apple Intelligence training), CCBot (Common Crawl).
The training vs retrieval distinction
The single most important concept to internalize.
- Training bots collect your content to train the model. The model then “knows” about your content. Blocking training stops your content from being baked into the model’s parameters, but the model can still cite you via retrieval.
- Retrieval bots fetch your URL in real-time when a user query needs current information. Blocking retrieval makes you invisible in AI search even if you’re well-known to the model.
You can block training without losing citations, as long as you allow retrieval. The reverse is not true.
The exact robots.txt allowlist
Drop-in for B2B SaaS, agencies and most content sites. Audited against 2026 bot user-agents.
# Allow all major AI crawlers (training + retrieval)
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: CCBot
Allow: /
# Default for all other crawlers
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xmlPlace this at https://yoursite.com/robots.txt. Validate with curl https://yoursite.com/robots.txt. Verify each user-agent is reachable with curl -A “OAI-SearchBot” https://yoursite.com/.
The Cloudflare / Vercel / WordPress traps
Three platforms account for 80% of accidental AI crawler blocks.
Cloudflare's default block (2024 change)
Vercel's strict bot rules
vercel.json exception for the 8 user-agents.WordPress security plugins (Wordfence, Sucuri)
The cheapest debugging tool for all 3: curl -A “OAI-SearchBot” https://yoursite.com/. If the response is 403 or 429, you’re blocked. If 200 with HTML, you’re good.
The 5-minute crawlability audit
Test each AI bot from the command line
curl -A “OAI-SearchBot” https://yoursite.com/. Repeat for ClaudeBot, PerplexityBot, GPTBot, ChatGPT-User. All should return 200 OK.Check robots.txt accessibility
curl https://yoursite.com/robots.txt. Should return 200 with content.Validate the user-agents in your robots.txt
Check sitemap accessibility
curl https://yoursite.com/sitemap.xml. AI crawlers use sitemaps to discover new content.Verify Cloudflare / Vercel / WordPress settings
Total time: 5 minutes. Catches 90% of accidental blocks.
What’s next
For the page-level optimizations that pay back once your site is crawlable, read How to Get Cited by AI Search Engines.
For the ChatGPT-specific setup including Bing Webmaster Tools and IndexNow, read How to Optimize for ChatGPT Search.
For the 12-week sprint, read How to Do GEO in 2026.
You can spend 12 weeks engineering perfect content for AI search and lose 100% of the citations because Cloudflare’s default blocks the crawler. Audit the foundation before optimizing the content.







