All posts
Robots.txtAI CrawlersGEOAEOSEOGPTBotClaudeBot

Robots.txt for AI Crawlers: The 2026 Guide (GPTBot, ClaudeBot, etc.)

How to block (or allow) AI crawlers in robots.txt — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot. Training vs live-answer bots, the right tradeoffs.

Nitish YadavMay 16, 2026

In 2026 your robots.txt is doing a job it didn't have in 2022: deciding whether AI labs can train on your content, and whether AI assistants can quote you in their answers. The two decisions are different, and most sites get them tangled.

This guide cuts through it. You'll learn how to block AI crawlers (or allow them strategically), which AI crawlers actually exist in 2026, the critical difference between training crawlers and live-answer crawlers, and the recommended robots.txt configuration for three common scenarios — public marketing site, paywalled content, and SaaS product.

What is robots.txt?

Robots.txt is a plain-text file at the root of your website (https://yoursite.com/robots.txt) that tells web crawlers which paths they may or may not visit. It's a voluntary protocol — well-behaved crawlers (Google, Bing, OpenAI, Anthropic, Perplexity) honor it; malicious crawlers ignore it. Robots.txt is not security; sensitive paths still need authentication. It's a coordination layer between site owners and the bots that crawl them.

The format is simple:

User-agent: GPTBot
Disallow: /

User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

That tells OpenAI's training crawler to stay out, lets all other crawlers in, and points to the sitemap.

Why does robots.txt matter for AI in 2026?

Three reasons specific to the AI era.

1. Training data control. AI labs (OpenAI, Anthropic, Google, Meta) crawl the public web to train future model versions. If you don't want your content in the next ChatGPT or Claude, robots.txt is the standard mechanism for opting out. There's no formal legal regime yet — robots.txt is the de facto consent layer.

2. AI search visibility. AI assistants now answer queries by fetching live web content at query time. If a user asks ChatGPT "what's the best CRM for SMBs?", ChatGPT may fetch a handful of sites and cite them in its answer. Block those crawlers and you're invisible in AI search.

3. The two are different crawlers. This is the part most sites get wrong. The bot that trains the model is not the same as the bot that fetches your page when a user asks a live question. You can — and probably should — allow one and block the other.

Which AI crawlers exist in 2026?

Here's the canonical list of major AI crawlers as of mid-2026, grouped by purpose:

Training crawlers (build the next model)

User-AgentOperatorPurpose
GPTBotOpenAITrains ChatGPT models
Google-ExtendedGoogleTrains Gemini and Vertex AI
ClaudeBotAnthropicTrains Claude models
PerplexityBotPerplexityTrains Perplexity's models
CCBotCommon CrawlPublic corpus used by many AI labs (open dataset)
Meta-ExternalAgentMetaTrains Llama / Meta AI

Blocking these stops your content from being used in the next training cycle. It does not affect whether AI assistants can quote you live.

Live-answer crawlers (fetch your page when a user asks)

User-AgentOperatorPurpose
ChatGPT-UserOpenAIChatGPT live-browse on user request
OAI-SearchBotOpenAIChatGPT search results
Claude-UserAnthropicClaude.ai live-browse on user request
Claude-SearchBotAnthropicClaude.ai search results
Perplexity-UserPerplexityPerplexity live-browse on user request

Blocking these makes you invisible in AI assistant answers. Most sites should leave these allowed — they're how your content reaches users of AI search products.

How do I block AI crawlers in robots.txt?

For each crawler you want to block, add a User-agent: line and a Disallow: / line. To block all training crawlers but allow live-answer crawlers (the recommended default for marketing sites):

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Allow everyone else (including AI live-answer crawlers and Google search)
User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

That's the "block training, keep live-answer visibility" pattern. Free training data for AI labs goes off the table; your content stays reachable when a user asks ChatGPT or Claude a question that should cite you.

If you want to go further and block everything AI:

User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Google-Extended
User-agent: ClaudeBot
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: CCBot
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Don't want to write this by hand? Our free Robots.txt Generator does it with first-class toggles for each AI crawler and three presets (allow all, block training only, block all AI). Plus standard rules (Disallow paths, Crawl-delay, Sitemap link).

Should I block AI crawlers? The honest tradeoff

This is where almost every blog gets it wrong. The honest answer depends on what your site is.

Public marketing site (most SaaS, agencies, content sites)

Recommendation: allow everything.

If your goal is awareness, leads, and sales, AI search is a free distribution channel. Blocking it doesn't stop AI labs from training on your content forever — they can use Common Crawl (which is harder to block selectively), buy training data, or scrape via headless browsers that don't identify themselves. Meanwhile, you've made yourself invisible in ChatGPT, Claude, Perplexity, and Google AI Overviews. The tradeoff is bad.

Paywalled / subscription content (NYT, financial newsletters, premium SaaS docs)

Recommendation: block training crawlers, keep live-answer crawlers allowed (with caveats).

You don't want AI labs to incorporate your premium content into a model that competes with your product. But blocking live-answer crawlers means AI assistants can't cite you, which reduces inbound interest. The compromise: block training crawlers (GPTBot, Google-Extended, ClaudeBot, PerplexityBot, CCBot), allow live-answer crawlers (ChatGPT-User, Claude-User, etc.) — and put serious content behind login.

B2B SaaS product UI (app subdomain, dashboard)

Recommendation: block everything from app.yoursite.com.

Your product UI shouldn't be in AI training data — it's not useful content, it might leak user data via screenshots in training corpora, and it has no SEO value. Block all crawlers from app subdomains with their own robots.txt that has User-agent: * / Disallow: /. Keep your marketing site (yoursite.com) allowing crawlers.

Where do I put robots.txt?

Robots.txt must live at the root of your domain at /robots.txt. Specifically:

  • https://yoursite.com/robots.txt — works
  • https://yoursite.com/seo/robots.txt — ignored
  • https://www.yoursite.com/robots.txt if your canonical is yoursite.com — ignored for that domain

Each subdomain needs its own robots.txt. app.yoursite.com/robots.txt is read independently of yoursite.com/robots.txt.

Test your file after upload:

curl -I https://yoursite.com/robots.txt
# Should return 200 OK with Content-Type: text/plain

curl https://yoursite.com/robots.txt
# Should display your rules

In Next.js (App Router), generate robots.txt via the app/robots.ts route handler so it's dynamic and version-controlled. In WordPress, plugins like Yoast SEO or All in One SEO manage it. In Cloudflare or Vercel, you can also set robots rules via edge functions.

What about the noai, noimageai meta tags?

Some sites use these meta tags instead of (or in addition to) robots.txt:

<meta name="robots" content="noai, noimageai">

Honest status as of 2026: these are mostly aspirational. Google honors some equivalent signals through its Google-Extended user agent and the nosnippet directive. Most other AI labs ignore the noai tag. They're a signal of intent but not a working block.

Robots.txt remains the standard mechanism that actually works (for crawlers that voluntarily comply, which the major AI labs do).

How do I verify a crawler is who it says it is?

A weaker bot can spoof the user-agent string. To verify a crawler is actually from OpenAI or Anthropic:

  1. Reverse DNS lookup the IP. Real GPTBot requests come from IPs in OpenAI's published IP range; reverse DNS resolves to openai.com.
  2. Forward DNS confirmation. The resolved hostname should forward-resolve back to the original IP.
  3. Check published IP ranges. OpenAI publishes its crawler IP list. Anthropic, Perplexity, Google all do too.

For most sites this is overkill — well-behaved crawlers identify themselves correctly. But if you see abusive crawl patterns that claim to be GPTBot, do the reverse-DNS check before reaching out to OpenAI to complain.

Pair this with our LLM-Friendly Website Score tool which audits robots.txt rules, llms.txt presence, and other technical signals AI crawlers care about. And the AEO Score tool which audits whether your content is built to be cited (FAQ schema, definitional paragraphs, etc.).

FAQ

Does robots.txt actually stop AI crawlers?

Yes, for crawlers that voluntarily comply with the protocol — which all the major AI crawlers (OpenAI, Anthropic, Google, Perplexity, Meta) do. It does not stop crawlers that ignore it (some smaller AI startups, most malicious scrapers). For absolute control over training data inclusion, you'd need server-side IP blocking or a paywall.

What's the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI's training crawler — it harvests pages to train future ChatGPT models. ChatGPT-User is OpenAI's live-answer crawler — it fetches a page on demand when a ChatGPT user explicitly asks for live information. Blocking GPTBot stops training; blocking ChatGPT-User makes ChatGPT unable to cite your page when answering a user's question. Most sites should block GPTBot but keep ChatGPT-User allowed.

Should I block Google-Extended?

It depends on whether you want your content to train Google Gemini. Google-Extended is Google's training crawler for Gemini and Vertex AI; it is separate from Googlebot (which crawls for search). Blocking Google-Extended does not affect Google search rankings. For most public marketing sites, allowing Google-Extended is fine — the same content already ranks on Google search.

What is CCBot?

CCBot is the Common Crawl bot, an open-source project that publishes a free crawl of the web that many AI labs use as a training corpus (OpenAI, Anthropic, Meta have all used Common Crawl data). Blocking CCBot prevents your content from entering this open corpus. Note: even if you block CCBot, AI labs can still crawl you directly via their own bots (GPTBot, ClaudeBot, etc.).

Do AI crawlers honor Crawl-delay?

Mixed. Bing and Yandex honor Crawl-delay. Google's main crawler ignores it (Google asks you to set crawl rate in Search Console). The major AI crawlers (GPTBot, ClaudeBot, PerplexityBot) currently do NOT honor Crawl-delay reliably. To control AI crawler load, use server-side rate limiting at the load balancer or WAF layer.

How often do AI crawlers update their list of user-agents?

OpenAI, Anthropic, and Perplexity have added new user-agents in 2024-2025 (the live-answer crawlers are newer than the training crawlers). Check the operator's official docs annually. Our Robots.txt Generator is updated whenever a major AI lab announces a new user-agent.

What's "noindex" vs robots.txt Disallow?

Disallow: in robots.txt tells crawlers not to fetch the page. <meta name="robots" content="noindex"> (and the equivalent HTTP header) tells crawlers they can fetch the page but shouldn't index it in search results. They serve different purposes and shouldn't be confused. To remove a page from Google search, use noindexDisallow won't remove an already-indexed page (Google can't fetch the page to see the noindex directive if it's blocked from fetching).

Where can I see if my robots.txt is being honored?

Check your server logs for the user-agent strings of the bots you've blocked. If you see GPTBot requests after Disallow-ing it, OpenAI is either (a) ignoring your robots.txt (unlikely — they comply), (b) the requests are spoofed (verify via reverse DNS), or (c) you haven't deployed the file yet (curl https://yoursite.com/robots.txt to confirm).

TL;DR

  • Robots.txt at /robots.txt controls which crawlers can read your site, including the 11 major AI crawlers in 2026
  • AI crawlers split into training (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Meta-ExternalAgent) and live-answer (ChatGPT-User, OAI-SearchBot, Claude-User, Claude-SearchBot, Perplexity-User)
  • The best default for most marketing sites: allow everything (don't kill your AI search visibility)
  • For paywalled content: block training, keep live-answer
  • For app subdomains: block all AI crawlers

Use the Robots.txt Generator — three presets, all 10 major AI crawlers as toggles, output ready to paste. Pair it with the LLM-Friendly Website Score to audit how AI crawlers see your site as a whole.

Ready to try InsiteChat?

Create an AI chatbot trained on your website in minutes.

Get started free