Why robots.txt Matters More Than Ever in 2026
For over 30 years, robots.txt has been the standard way websites communicate with web crawlers. It is the first file any well-behaved bot checks before crawling your site. But in 2026, the robots.txt landscape has fundamentally changed.
Alongside Googlebot and Bingbot, there are now over a dozen major AI crawlers actively scanning the web. OpenAI's GPTBot, Anthropic's ClaudeBot, Perplexity's PerplexityBot, and others are constantly indexing content to power AI search experiences used by hundreds of millions of people.
The stakes are real: if your robots.txt blocks these AI crawlers, your content will not appear in ChatGPT Search results, Perplexity answers, Google AI Overviews, or Claude responses. That means you are missing out on a rapidly growing source of referral traffic and brand visibility.
On the other hand, if you want to protect premium or paywalled content from being used for AI training, your robots.txt is still your primary tool for controlling access. This guide will help you make the right decisions for your website, with concrete code examples you can copy and paste today.
What Are AI Crawlers?
AI crawlers are automated bots deployed by AI companies to fetch web content. Unlike traditional search engine crawlers that primarily index pages for link-based search results, AI crawlers collect content that is used to:
Power real-time AI search answers (ChatGPT Search, Perplexity, Google AI Overviews)
Train and fine-tune large language models
Enable AI-powered browsing features when users ask an AI to visit a URL
Build knowledge graphs used for AI reasoning and fact-checking
Here is the complete list of major AI crawlers you should know about in 2026, along with their User-Agent strings:
| Crawler | Company | User-Agent | Purpose |
|---|---|---|---|
| GPTBot | OpenAI | GPTBot | Powers ChatGPT Search and trains OpenAI models |
| ChatGPT-User | OpenAI | ChatGPT-User | ChatGPT browsing mode when users ask it to visit pages |
| OAI-SearchBot | OpenAI | OAI-SearchBot | Dedicated search crawler for OpenAI search features |
| ClaudeBot | Anthropic | ClaudeBot | Claude AI information gathering and search |
| PerplexityBot | Perplexity | PerplexityBot | Powers Perplexity Search -- a major AI search engine |
| Google-Extended | Google-Extended | Gemini AI and Google AI Overviews training data | |
| Bytespider | ByteDance | Bytespider | TikTok AI and ByteDance model training |
| Amazonbot | Amazon | Amazonbot | Alexa answers, Amazon AI, and product discovery |
| Applebot-Extended | Apple | Applebot-Extended | Apple Intelligence, Siri, and Safari AI features |
| CCBot | Common Crawl | CCBot | Open dataset used by many AI companies for training data |
| Meta-ExternalAgent | Meta | Meta-ExternalAgent | Meta AI, Llama models, and Facebook/Instagram AI features |
Key insight: Each of these crawlers respects the robots.txt standard independently. That means you can allow some AI crawlers while blocking others -- giving you granular control over which AI platforms can access your content.
How to Allow AI Crawlers
If you want AI search engines to index your content (recommended for most websites), you need to explicitly allow their crawlers. Here is how to configure your robots.txt:
Option 1: Allow all AI crawlers
The simplest approach is a broad wildcard rule that allows all bots. If your robots.txt does not specifically block a User-Agent, it is allowed by default. But being explicit is better practice:
# Allow all AI crawlers (explicit)
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Meta-ExternalAgent
Allow: /Option 2: Allow with selective restrictions
You can allow AI crawlers access to your public content while protecting specific directories:
# Allow AI crawlers with selective restrictions
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Allow: /products/
Disallow: /members/
Disallow: /admin/
Disallow: /api/
User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Allow: /products/
Disallow: /members/
Disallow: /admin/
Disallow: /api/Pro tip: If you are already allowing Googlebot to crawl your site with a User-agent: * wildcard, AI crawlers are technically already allowed. But adding explicit rules for each AI crawler makes your intent clear and gives you finer control.
How to Block Specific AI Crawlers
There are legitimate reasons to block certain AI crawlers. Common scenarios include:
You run a premium content site and don't want your articles used to train AI models
You want to allow AI search engines but block training-only crawlers
You have concerns about a specific company's data practices
Your content licensing agreements restrict AI training usage
You want to reduce server load from aggressive crawlers
Block all AI crawlers
# Block all known AI crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /Block training crawlers, allow search crawlers
A popular middle-ground approach is to block crawlers that primarily train models while allowing those that power search features:
# Allow AI search crawlers
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Google-Extended
Disallow: /Important: Blocking GPTBot will prevent your content from appearing in ChatGPT Search results but it does not retroactively remove data already collected. The distinction between "training" and "search" crawlers is evolving. OpenAI now offers OAI-SearchBot as a separate search-only crawler.
The Recommended robots.txt for Most Websites
For the majority of websites -- blogs, SaaS products, e-commerce stores, documentation sites -- you want maximum visibility. Here is the robots.txt template we recommend:
# ============================================
# robots.txt - Optimized for AI + Traditional SEO
# Generated with SEOScanHQ (seoscanhq.com)
# ============================================
# Default: allow all crawlers
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /private/
# ---- AI Search Crawlers (allow) ----
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
# ---- Training-only crawlers (optional: block) ----
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
# ---- Sitemap ----
Sitemap: https://yoursite.com/sitemap.xmlThis template follows a clear philosophy:
Allow all search-oriented AI crawlers so your content appears in AI-powered search results
Explicitly block private directories (admin, API, user accounts) from all crawlers
Optionally block training-only crawlers (CCBot, Bytespider) that do not provide search visibility in return
Include your sitemap URL so all crawlers can discover your full page inventory
Common Mistakes
Using Disallow: / for all bots without exceptions
A blanket 'User-agent: * / Disallow: /' blocks every crawler, including Googlebot and all AI bots. This makes your site invisible to both traditional and AI search engines. Instead, use specific User-agent rules for each crawler you want to block.
Wrong syntax or typos in User-Agent names
User-Agent strings are case-sensitive in practice. 'User-agent: gptbot' may not match GPTBot correctly. Always use the exact User-Agent string published by each AI company (e.g., GPTBot, ClaudeBot, PerplexityBot).
Forgetting about Crawl-delay
Some AI crawlers can be aggressive. While not all bots respect Crawl-delay, adding it can help protect your server. A value of 10 (seconds between requests) is reasonable for most sites.
Placing robots.txt in the wrong directory
robots.txt must be at the root of your domain (yoursite.com/robots.txt). Placing it at yoursite.com/public/robots.txt or in a subdirectory will not work.
Not including a Sitemap directive
Always include your Sitemap URL at the bottom of robots.txt. This helps AI crawlers discover all your pages, not just those they find through links.
Blocking AI crawlers but expecting AI search visibility
If you block GPTBot, your content will not appear in ChatGPT Search results. If you block PerplexityBot, Perplexity cannot cite your pages. Blocking and then complaining about low AI visibility is a common contradiction.
Testing Your robots.txt
After updating your robots.txt, verify everything works correctly using these steps:
- 1
Browser check
Visit yoursite.com/robots.txt in your browser. You should see the raw text content, not an HTML page or a 404 error.
- 2
Syntax validation
Ensure there are no typos in User-agent names. Each block needs a User-agent line followed by at least one Allow or Disallow line.
- 3
Google Search Console
Use the robots.txt Tester in Google Search Console to verify Googlebot-specific rules are working correctly.
- 4
HTTP headers check
Use curl -I yoursite.com/robots.txt to verify the file returns HTTP 200 with content-type text/plain. If it returns 301 or 404, crawlers may not read it.
- 5
AI-specific scan with SEOScanHQ
Run a comprehensive AI SEO scan that checks your robots.txt configuration for all 11 major AI crawlers, identifies missing rules, and validates your overall AI-readiness.
curl -I https://yoursite.com/robots.txt
# Expected:
# HTTP/2 200
# content-type: text/plain; charset=utf-8robots.txt vs llms.txt
With the rise of AI search, there is now a second file you should be aware of: llms.txt. While they sound similar, they serve very different purposes.
| Aspect | robots.txt | llms.txt |
|---|---|---|
| Purpose | Access control -- allow or block crawlers | Content description -- help AI understand your site |
| Format | Directive-based (User-agent, Allow, Disallow) | Markdown with headings, descriptions, and links |
| Audience | All web crawlers | AI language models specifically |
| Analogy | Like a bouncer at the door ("you may/may not enter") | Like a tour guide ("here is what we have inside") |
| Required? | Yes, for all websites | Recommended for AI visibility |
Bottom line: Use robots.txt to control who can crawl your site, and llms.txt to explain what your site is about. They are complementary. For maximum AI search visibility, you need both. Read our complete llms.txt guide to get started.
Frequently Asked Questions
Should I block or allow AI crawlers in robots.txt?
For most websites, allowing AI crawlers is recommended. Being indexed by AI search engines like ChatGPT Search and Perplexity means more visibility and traffic. Only block AI crawlers if you have specific concerns about content being used for AI training without compensation, or if you run a premium content site behind a paywall.
What is GPTBot and how do I control it with robots.txt?
GPTBot is OpenAI's web crawler that collects data for ChatGPT Search and model training. You can control it by adding 'User-agent: GPTBot' followed by 'Allow: /' (to permit crawling) or 'Disallow: /' (to block it) in your robots.txt file. You can also selectively allow or block specific directories.
Does blocking AI crawlers in robots.txt affect my Google ranking?
Blocking AI-specific crawlers like GPTBot, ClaudeBot, or PerplexityBot does not affect your Google search rankings. These are separate from Googlebot. However, blocking Google-Extended will prevent your content from appearing in Google AI Overviews, which can reduce your overall search visibility.
How do I test if my robots.txt is correctly blocking or allowing AI crawlers?
You can test your robots.txt by visiting yoursite.com/robots.txt in a browser to check the syntax, use Google Search Console's robots.txt tester for Googlebot rules, and use SEOScanHQ to run a comprehensive scan that validates your AI crawler configuration alongside 43 other AI-readiness checkpoints.
What is the difference between robots.txt and llms.txt for AI crawlers?
robots.txt is an access control file that tells crawlers which pages they can or cannot visit. llms.txt is a content description file that helps AI language models understand what your site is about. robots.txt controls access; llms.txt provides context. You should use both for optimal AI search visibility.
Related Resources
Complete llms.txt Guide
Everything you need to know about the llms.txt standard for AI-readable websites.
AI SEO Guide (2026)
Complete guide to optimizing your website for AI search engines.
Structured Data for AI SEO
How to use JSON-LD and Schema.org to boost your AI search visibility.
More Articles
Browse all our AI SEO guides and insights on the blog.