Clairon

How to Make Your Site AI-Crawlable: The 8-Bot robots.txt Strategy for 2026

Hugo Debrabandere

Hugo Debrabandere

Co-founder · Clairon

Apr 29, 2026

The most expensive GEO mistake of 2025: silent AI crawler blocks. Cloudflare changed its default in 2024 to block AI bots. Vercel ships strict bot rules. WordPress security plugins block aggressively. 22 of the 60+ B2B SaaS sites we audited had at least one AI crawler blocked without anyone on the team realizing. Every page rewrite, schema build and Reddit reply is wasted if the engine cannot fetch your URL.

Below: the 8 AI bots you need to know, the training vs retrieval distinction (block one, allow the other), the exact robots.txt allowlist, the 5-minute audit, and the Cloudflare / Vercel / WordPress traps that catch most teams.

The 8 AI bots you need to know

8 AI bot user-agents and what to do with them
Bot user-agentOwnerPurposeBlock?
GPTBotOpenAITraining (ChatGPT)Optional
OAI-SearchBotOpenAIRetrieval (live ChatGPT search)Never block
ChatGPT-UserOpenAIUser-triggered fetchesAllow
ClaudeBotAnthropicTraining (Claude)Optional
Claude-UserAnthropicUser-triggered fetchesAllow
Claude-SearchBotAnthropicRetrieval (Claude search)Never block
PerplexityBotPerplexityRetrieval and indexingNever block
Google-ExtendedGoogleTraining (Gemini, AIO)Optional

Three additional bots worth knowing: Bingbot (feeds ChatGPT via the OpenAI/Microsoft partnership, never block), Applebot-Extended (Apple Intelligence training), CCBot (Common Crawl).

The training vs retrieval distinction

The single most important concept to internalize.

  • Training bots collect your content to train the model. The model then “knows” about your content. Blocking training stops your content from being baked into the model’s parameters, but the model can still cite you via retrieval.
  • Retrieval bots fetch your URL in real-time when a user query needs current information. Blocking retrieval makes you invisible in AI search even if you’re well-known to the model.

You can block training without losing citations, as long as you allow retrieval. The reverse is not true.

The exact robots.txt allowlist

Drop-in for B2B SaaS, agencies and most content sites. Audited against 2026 bot user-agents.

robots.txt
# Allow all major AI crawlers (training + retrieval)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: CCBot
Allow: /

# Default for all other crawlers
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Place this at https://yoursite.com/robots.txt. Validate with curl https://yoursite.com/robots.txt. Verify each user-agent is reachable with curl -A “OAI-SearchBot” https://yoursite.com/.

The Cloudflare / Vercel / WordPress traps

Three platforms account for 80% of accidental AI crawler blocks.

Cloudflare's default block (2024 change)

In late 2024, Cloudflare flipped its default Bot Management settings to block AI crawlers including OAI-SearchBot, GPTBot, ClaudeBot. Sites that never edited their bot settings are blocking AI crawlers right now. Fix: log into Cloudflare → Security → Bots → Bot Fight Mode → Configure AI bot rules → Allow the 8 user-agents above.

Vercel's strict bot rules

Vercel applies aggressive bot detection on Pro plans. AI crawlers without browser-like headers can get rate-limited or blocked. Fix: add a vercel.json exception for the 8 user-agents.

WordPress security plugins (Wordfence, Sucuri)

Default rules block any user-agent that doesn’t match a curated allow-list. AI bots are often missing. Fix: explicitly whitelist the 8 user-agents in the plugin’s bot allowlist.

The cheapest debugging tool for all 3: curl -A “OAI-SearchBot” https://yoursite.com/. If the response is 403 or 429, you’re blocked. If 200 with HTML, you’re good.

The 5-minute crawlability audit

Test each AI bot from the command line

curl -A “OAI-SearchBot” https://yoursite.com/. Repeat for ClaudeBot, PerplexityBot, GPTBot, ChatGPT-User. All should return 200 OK.

Check robots.txt accessibility

curl https://yoursite.com/robots.txt. Should return 200 with content.

Validate the user-agents in your robots.txt

Read your robots.txt and confirm none of the AI bots are in a Disallow rule.

Check sitemap accessibility

curl https://yoursite.com/sitemap.xml. AI crawlers use sitemaps to discover new content.

Verify Cloudflare / Vercel / WordPress settings

Log into each layer and confirm the AI bot allowlist matches the 8 above. Audit quarterly.

Total time: 5 minutes. Catches 90% of accidental blocks.

What’s next

For the page-level optimizations that pay back once your site is crawlable, read How to Get Cited by AI Search Engines.

For the ChatGPT-specific setup including Bing Webmaster Tools and IndexNow, read How to Optimize for ChatGPT Search.

For the 12-week sprint, read How to Do GEO in 2026.

You can spend 12 weeks engineering perfect content for AI search and lose 100% of the citations because Cloudflare’s default blocks the crawler. Audit the foundation before optimizing the content.

Frequently asked questions

Should I block GPTBot to prevent OpenAI from training on my content?
Trade-off, your call. Blocking GPTBot prevents future training. The cost: you may show up less in ChatGPT's organic answers. For most B2B SaaS teams, allowing GPTBot is net-positive.
What about llms.txt, do I need it?
llms.txt is an emerging standard. It does not replace robots.txt. As of Q2 2026, support is partial across engines. Worth adding for forward compatibility, not yet a critical lever.
How often should I audit AI crawlability?
Monthly minimum. Quarterly comprehensive (with cloud-VM curl tests). Cloudflare and Vercel change defaults silently.
What if I want to allow some AI engines and not others?
Possible but rarely useful. The maintenance burden is high and the citation cost of blocking any retrieval bot is significant. Most teams either allow all or block all training.
Does my CDN's IP allowlist matter?
Yes. AI crawler IPs change. AWS, GCP and Azure publish their IP ranges; AI providers rotate within those ranges. Allow data-center IPs broadly or use the official user-agent strings and don't filter by IP at all.
How do I know my robots.txt is being respected by AI crawlers?
The major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) publicly commit to honoring robots.txt. Smaller / unknown AI bots may not respect robots.txt; for those, use IP-level blocking.
Summarize with Claude
Summarize with Perplexity
Summarize with Google
Summarize with Grok
Summarize with ChatGPT