Three jobs, three kinds of crawler

Every major engine runs more than one crawler, and they do not do the same thing. Sorting them is the whole point, because allowing the wrong one gets you nothing and blocking the wrong one can cost you citations you wanted.

A training crawler collects content that may feed a model's training. Allowing or blocking it is a stance on whether your work is used to train AI; it does not decide whether you get cited in answers. A search crawler builds the index the engine reads from when it answers a live question, so this is the one that must be able to reach you for a citation to be possible. A user-initiated fetcher visits a specific page when a person's request sends the engine there, and because a user triggered it, these generally ignore robots.txt.

So "allow AI crawlers" really means one thing for visibility: make sure the search crawlers can read you. The rest is a separate decision.

Which crawler to allow, by engine

These are the search crawlers - the user-agents that earn citations - confirmed against each provider's own documentation:

  • OpenAI / ChatGPT: allow OAI-SearchBot. It is "used to surface websites in search results in ChatGPT's search features," and opted-out sites "will not be shown in ChatGPT search answers." GPTBot is training-only; ChatGPT-User is the user fetch.
  • Anthropic / Claude: allow Claude-SearchBot, which Anthropic describes as navigating the web "to improve search result quality" and "the relevance and accuracy of search responses." ClaudeBot is the training crawler; Claude-User is the user fetch.
  • Perplexity: allow PerplexityBot, "designed to surface and link websites in search results on Perplexity," which Perplexity says respects robots.txt. Perplexity-User is the live, user-triggered fetch and generally ignores robots.txt.
  • Google: allow Googlebot, the regular search crawler. Google's AI Overviews and AI Mode draw from the same index, so there is no separate "AI" crawler to allow. The Google-Extended token only governs training use of your content and has no effect on Search or AI Overviews.

Microsoft Copilot leans on Bing's index, so if you already allow Bingbot for Bing search, you are covered there too. For newer engines the same shape holds: find the search crawler and allow it; the training crawler is optional.

A robots.txt that allows the search crawlers

In robots.txt, anything not disallowed is already allowed, so most sites do not need to add anything - they need to make sure they are not blocking these bots. The usual culprit is a blanket rule like this, which turns everyone away, AI search crawlers included:

User-agent: *
Disallow: /

If you want to state intent explicitly, or you need to carve the search crawlers out of a broader block, name them and allow them:

# Allow the AI search crawlers - these decide whether you can be cited
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot
Allow: /

If your goal is to allow citations while opting out of training, keep the block above and add a separate disallow for the training crawlers (GPTBot, ClaudeBot) plus a Google-Extended opt-out. The two choices do not interfere with each other.

robots.txt is permission, not access

You can allow every search crawler in robots.txt and still be unreachable, because robots.txt only states a policy - it does not open the door. A CDN bot rule, a web-application firewall, a 403 or 406, an aggressive rate limit, or a JavaScript challenge can all turn a crawler away after robots.txt has said yes. In practice that edge block is the more common reason an AI engine cannot read a site, and it is invisible if you only read your robots.txt file.

So verify access, do not assume it. Request your own page with each crawler's user-agent and confirm a 200 comes back rather than a block page. The full method and the rest of the technical floor are in AI crawler readiness, and our AI crawler checklist walks the live-response test step by step.

Allowing them is the floor, not the finish

Allowing the crawlers makes a citation possible; it does not make it happen. Once an engine can read you, it still only cites you if your page hands it a clean answer and trusted sources already name you. Being perfectly crawlable and still uncited is common - it is the whole finding of our Invisible 10 study, where ten funded vendors with readable sites drew zero citations across 600 responses on the four largest engines. Crawlability is the precondition; what turns it into citations is covered in AI search visibility, and the ChatGPT-specific version is in how to get cited by ChatGPT.

How Web Cited helps

Allowing the right crawlers is a five-minute fix once you know which ones matter and where the real block usually hides. Our AI crawler checklist covers the robots.txt and live-response checks, and the free 10-minute AI search audit shows you where you stand right now. To see whether the engines actually reach and cite you once access is open, the Free Snapshot gives a current read, and the SXO Audit runs 25 buyer prompts across six engines with three trials each over time so you can watch your citation share move.

Try the Free Snapshot   See the SXO Audit

By the Web Cited Editorial Research Team. Last updated 1 June 2026.