Why this matters now

Gartner projects that 25% of traditional search volume migrates to AI-assistant interfaces by the end of 2026. The Invisible 10 study Web Cited ran in May 2026 measured what that means in practice for one category: 10 funded B2B compliance vendors tested against 600 LLM responses across ChatGPT, Claude, Gemini, and Perplexity returned zero citations of any of the named brands. The same 600 responses repeatedly named a different set of vendors entirely: TalentLMS at 107 mentions, LogicGate at 73, AuditBoard at 69.

One reason for the gap is content. Another reason, and the easier one to fix, is that some sites in the study were not readable to the AI crawlers in the first place. One of the 10 vendors blocked our discovery user-agent at the homepage entirely. The other nine each had between 3 and 15 detectable homepage fixes before their content even had a chance to be indexed. The technical floor is closer to a CDN-rule change than a six-month content roadmap.

This checklist is the floor: the things to fix before any content strategy. If your homepage fails any of these, the AI engines whose buyers you want to reach are reading a 403 page instead of your product positioning.

The four bots that matter

The list of bots that crawl the web on behalf of AI engines is longer than four, but four cover the buyer-facing surface area for most B2B brands as of mid-2026:

  • GPTBot (OpenAI) - feeds ChatGPT's web-browsing answers and OpenAI's model-training corpus. User-agent contains GPTBot; documented at platform.openai.com/docs/gptbot.
  • ClaudeBot (Anthropic) - feeds Claude's web-browsing answers and Anthropic's training. User-agent contains ClaudeBot; documented at docs.claude.com.
  • PerplexityBot (Perplexity) - feeds Perplexity's live web answers. User-agent contains PerplexityBot; documented at docs.perplexity.ai.
  • Google-Extended (Google) - the opt-in/out signal for Google's Bard/Gemini training; Google's regular crawler (Googlebot) still indexes for AI Overviews regardless. Documented in Google's crawler overview.

Two more worth knowing about: OAI-SearchBot (a separate OpenAI agent for SearchGPT live results, distinct from GPTBot) and anthropic-ai (an older Anthropic identifier some sites still allow-list for backward compatibility). They appear in real logs but are not the primary citation surface.

The 10 checks

Each one is testable in under a minute. Each one is a real-world failure mode we have seen in the field.

1. Your homepage returns 200 to GPTBot, ClaudeBot, and PerplexityBot

The check that matters most. Open a terminal and run:

curl -sI -A "GPTBot/1.1 (+https://openai.com/gptbot)" https://your-domain.com/
curl -sI -A "Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected])" https://your-domain.com/
curl -sI -A "Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)" https://your-domain.com/

Every response should start with HTTP/2 200 or HTTP/1.1 200. If you see 403, 406, 429, or a redirect to a CAPTCHA challenge, your CDN or WAF is blocking the bot. That is the most expensive default to leave in place.

2. Your robots.txt does not Disallow them

Fetch https://your-domain.com/robots.txt and look for any of these:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

If you see Disallow: / against any of those user-agents, you have opted out of being read by the corresponding engine. Some teams add these defensively after a security advisory, then forget. Others have them set by a CMS plugin's default-on setting. Either way, the file is the contract; the engines respect it.

A Crawl-delay directive set high (anything above 10 seconds) is also worth a look. It is not a block, but it slows the crawler enough that a freshly published page may take days to enter the engine's index.

3. Your product positioning is server-rendered into the homepage HTML

AI crawlers are not browsers. They fetch HTML and parse text. If your homepage's headline, product description, and pricing live inside a React or Vue component that hydrates after page load, the bot's view of your homepage is the empty shell.

Test it the same way the crawler does:

curl -sA "GPTBot/1.1" https://your-domain.com/ | grep -i "your-product-name"

If your brand name does not appear in the curl output, the bot does not see it. Fix: server-side render the above-the-fold content, or pre-render at build time.

4. You have a meaningful page title and meta description

This sounds like 2010 SEO advice because it is. AI engines lean on the same metadata SEO has used for two decades: the <title>, the meta description, the h1. If your homepage title is Home | Acme, that is a title that does no work in any search context, AI or otherwise. Title under 60 characters, description under 160, h1 that matches the title intent.

5. You have an Organization schema in JSON-LD

One <script type="application/ld+json"> block in your <head> declaring your organization name, URL, logo, and contact point. Schema.org's Organization type. The fields are well documented. Validate with Google's Rich Results Test.

This is what helps a crawler reconcile "the brand we just crawled" with "the entity ChatGPT already has facts about." Without it, the crawler has to guess from your domain and page content. With it, the link is explicit.

6. Your sitemap.xml exists, lives at /sitemap.xml, and is current

If you have a sitemap, the crawler does not have to walk your link graph from the homepage. If your sitemap is six months stale, the crawler indexes six-month-old URLs. The fastest check:

curl -sI https://your-domain.com/sitemap.xml | head -1
curl -s https://your-domain.com/sitemap.xml | grep -c "<url>"

Expect HTTP/2 200 on the first command and a non-zero count on the second. Bonus check: regenerate the sitemap when content changes, not on a fixed cron.

7. You have an llms.txt file in the root

llms.txt is a 2024 proposal (llmstxt.org) for a Markdown file at the site root that gives LLMs a curated index of the site's most important pages. Adoption is still partial. The cost of adding one is small (a static file with links and one-line summaries), and crawlers that respect it skip directly to the canonical pages instead of guessing from the navigation. Worth adding even before broad adoption stabilizes.

8. Your homepage is reachable without a JavaScript challenge

Cloudflare's "Bot Fight Mode" and "Super Bot Fight Mode", AWS WAF's bot-control rule, Akamai's Bot Manager, Imperva, and most commercial WAF products ship with rules that flag non-browser user-agents and serve a JavaScript challenge or an outright block. The default state of those rules on a fresh deployment is often "challenge anything that does not look like a real browser." AI crawlers do not run JavaScript challenges. They get the challenge response and move on.

Cloudflare specifically: check the Bot Management section of your zone settings. The "Verified Bots" allow-list as of mid-2026 includes GPTBot, ClaudeBot, PerplexityBot, and Google-Extended by default in most plans, but the allow-list only applies if Verified Bots is enabled. Enable it.

9. Your homepage does not paywall, gate, or redirect on first visit

Cookie banners that block content with an opaque modal, geo-redirects that send first-time visitors to a country-specific subdomain, and login-walls on what should be public marketing pages all degrade what the crawler sees. A bot that does not click "Accept All" reads the un-dismissed banner as the page's primary content. A bot that does not have a US IP gets sent to the EU page, and the EU page might have different content. If your homepage requires an interaction to expose the actual content, the AI engines will index whatever is visible without the interaction.

10. Your homepage's primary content is the same across the AI crawlers' user-agents

Some sites serve different content to crawlers than to humans. Sometimes accidentally (a CDN cache key includes the user-agent), sometimes intentionally (an old anti-scraping rule), and sometimes as a side effect of A/B-testing infrastructure. Fetch the homepage as a regular browser and as each of the AI crawler user-agents, and diff the rendered HTML. If the diff is large, you have cloaking. Cloaking degrades indexing across every engine.

The common defaults that quietly block these bots

We see the same handful of root causes in practice. None of them are obscure; all of them are defaults someone enabled and forgot about.

  • Cloudflare Bot Fight Mode set to "Super Bot Fight Mode" or "Bot Fight Mode" without "Verified Bots" enabled. The Cloudflare-defined "Verified Bots" category covers the major AI crawlers, but only when Verified Bots is on. Without it, the bots are challenged or blocked alongside scrapers.
  • WAF rule sets from AWS, Akamai, or Imperva that include a "non-browser user-agent" or "missing-cookies" rule with a default action of BLOCK or CHALLENGE. AI crawlers send a non-browser User-Agent header and have no session cookies. They fail this rule by design.
  • CMS plugins (especially WordPress security plugins like Wordfence, iThemes Security, and others) that ship with a "block known bots" list configured to include AI crawlers. The list is updated by the plugin vendor; we have seen vendors add AI crawlers to the default block-list after a customer complaint about training data, then never remove them.
  • Geographic IP blocks set to deny non-US or non-EU traffic. AI crawlers run from a small set of cloud-provider IP ranges; if the range is in a denied geography, the bot does not reach your origin.
  • SPA-only rendering with no SSR / pre-render. The bot fetches the HTML, sees an empty <div id="app">, and indexes the React-flash text or nothing. Search-engine SEO advice from 2018 already covered this; the AI-crawler implication is identical.
  • An old robots.txt with Disallow: / for one of the AI user-agents, added as a defensive measure during a 2023 training-data debate and never removed.

What this checklist does not cover

This is the technical floor. It gets the AI crawler reading your site. It does not get you cited.

Citation is a separate layer. To be cited, your brand needs to be referenced by content the AI engines have already indexed: third-party comparison posts, Wikipedia mentions, podcast transcripts, YouTube captions, review sites, analyst writeups. A site that aces this checklist but has zero third-party mentions is a readable site that AI engines simply do not have a reason to mention. That layer is what Web Cited's SXO Audit measures end-to-end - the homepage technical baseline plus the citation footprint plus the per-buyer-prompt scoreboard across the six largest LLM engines.

The technical floor is still the first thing to fix. If the bot cannot read the homepage, no amount of citation work helps.

Want this run against your site?

Web Cited's SXO Audit runs all 10 checks plus another ~25 on the homepage, plus measures whether your brand actually gets cited across ChatGPT, Claude, Gemini, Perplexity, Grok, and DeepSeek for the buyer prompts you care about. Often delivered within hours. $5,000 fixed. No sales call.

Order an audit

Or run the curl commands above yourself. The point of publishing the checklist is that the technical floor is not proprietary.