AI bots now account for a quarter of web traffic. What should marketers do?
PublishedJune 12, 2026 · Quratic Team · 11 min read
GPTBot, ClaudeBot, and PerplexityBot are crawling sites at unprecedented scale. Here is how to read the traffic, learn from it, or block it without breaking AI visibility.
Something shifted in your server logs in 2025–2026, even if your analytics dashboard still looks flat. AI crawlers — GPTBot, ClaudeBot, PerplexityBot, Applebot, Meta-ExternalAgent — are now a major share of automated web traffic. According to Cloudflare Radar data cited by Digital Applied, AI-related bots accounted for roughly 26.7% of verified bot traffic in May 2026. Matthew Prince noted in June 2026 that bots overall now generate 57.5% of HTML web traffic on Cloudflare’s network — the first time automated requests exceeded human requests.
That is not the same as AI referral traffic in Google Analytics. It is something different — and more important for marketing teams to understand.
Crawl traffic vs referral traffic: do not confuse them
Two metrics get conflated constantly:
| Metric | What it measures | Typical scale |
|---|---|---|
| AI bot crawl traffic | Automated requests to your pages (indexing, training, retrieval) | Large and growing — billions of requests globally |
| AI referral traffic | Human visitors who clicked through from an AI answer | Small today — often 0.1% to 1% of total site traffic |
OnCrawl’s analysis of AI bot behaviour finds that while 100% of sites they work with see some AI-generated traffic in analytics, the median share is around 0.4% of total visits. Crawl volume is orders of magnitude higher than click-through volume.
This gap matters strategically:
- High crawl, low referral — AI is reading your site but not citing or linking to you yet. That is a content structure and authority problem, not a traffic problem.
- Low crawl, low referral — you may be blocked, hard to parse, or absent from the indexes AI search uses.
- Rising crawl, rising referral — the compounding loop is working.
Marketing teams that only watch GA4 miss the first signal entirely. Server logs and bot analytics show it weeks or months earlier.
How fast AI bot traffic is growing
The trend line is steep:
OpenAI tripled its crawl rate. Botify’s analysis of 7 billion log files (November 2024 – March 2026) found OpenAI’s total crawl activity increased 2.9x for GPTBot and 3.5x for OAI-SearchBot since August 2025. OpenAI’s share of Google’s crawl volume rose from 1.38% to 4% in one year — still small vs Google, but closing fast.
Training crawlers hit 50% of AI bot traffic. Presenc AI’s Cloudflare Radar analysis reports training crawlers reached 49.9% of all AI bot traffic in Q1 2026 — hitting a milestone one quarter earlier than predicted. GPTBot, ClaudeBot, Meta-ExternalAgent, and CCBot scrape content for model training. OAI-SearchBot, PerplexityBot, and Claude-SearchBot index content for live AI answers.
New entrants are reshuffling share monthly. Applebot surged after Apple’s Intelligence push (Presenc AI). GPTBot’s share declined two consecutive months in early 2026 as other crawlers grew faster — rankings flip month to month, so a single snapshot misleads.
Publishers are pushing back. OnCrawl notes that according to Cloudflare Radar, GPTBot was the most blocked bot in the world in 2025 — ahead of Googlebot and Bingbot. Blocking is widespread, but often applied with blunt rules that create unintended visibility gaps.
What AI bots are actually doing on your site
Not all AI bot visits serve the same purpose. Digital Applied’s 2026 access control matrix and LumenGEO’s robots.txt guide distinguish three classes:
| Class | Examples | Purpose | Block impact |
|---|---|---|---|
| Training scrapers | GPTBot, ClaudeBot, CCBot, Google-Extended, Meta-ExternalAgent | Feed future model training | No direct citation impact — content excluded from training datasets |
| Retrieval / search crawlers | OAI-SearchBot, PerplexityBot, Claude-SearchBot, Bingbot | Build indexes for AI search answers | Removes you from that engine’s AI answers |
| User-triggered fetchers | ChatGPT-User, Perplexity-User, Claude-User | Fetch a page when a user asks a live question | Removes you from live citations for that session |
The critical mistake: treating all AI bots as one category. Blocking GPTBot (training) does not remove you from ChatGPT search. Blocking OAI-SearchBot or PerplexityBot (retrieval) does.
OpenAI, Anthropic, Google, and Amazon now run separate user-agents for training vs search. SEOLint’s bot reference documents the split — and warns that deprecated strings like Claude-Web no longer match active crawlers.
How to use AI bot traffic to learn
If AI bots are visiting your site, that traffic is free intelligence — if you know how to read it.
1. Audit server logs, not just analytics
Google Analytics does not show bot crawl volume reliably. Your CDN or origin server logs do.
Look for:
- Which bots hit your site (GPTBot, ClaudeBot, PerplexityBot, Applebot)
- Which URLs they request most (product pages, blog posts, docs, pricing)
- Crawl frequency over time (is indexing accelerating after a content update?)
- Status codes returned (404s and 5xxs waste crawl budget and signal broken pages)
OnCrawl recommends filtering logs for 200 and 304 responses to measure actual content consumption — not just bot hits that bounce off blocked paths.
2. Map crawled pages to your GEO prompt library
Cross-reference the URLs AI bots crawl most with the prompts you care about in AI search:
- If PerplexityBot crawls your
/compare/page weekly but never your/pricing/page, your comparison content may surface in answers while pricing queries do not. - If blog posts get crawled but product pages do not, your information architecture may be funnelling bots toward top-of-funnel content only.
This is a cheaper diagnostic than running 500 prompts manually — server logs tell you what bots already consider worth indexing.
3. Correlate crawl patterns with AI visibility scores
When crawl frequency on a URL cluster increases, check whether your share of voice in AI answers moves on related prompts 2–4 weeks later. Index updates lag crawls — the correlation is not instant, but teams that track both see leading indicators earlier than competitors watching referrals alone.
4. Identify parseability problems
AI crawlers favour pages that are easy to extract: clear headings, direct answers in opening paragraphs, structured data, clean HTML. If bots crawl a URL repeatedly but you never appear in AI answers for related queries, the page may be indexed but failing retrieval quality gates — the same dynamic described in our Perplexity visibility guide.
Fix the content structure before blocking the bot.
5. Track referral growth separately
SE Ranking data reported by Search Engine Journal shows Claude referral traffic grew 386% from January to April 2026 — still tiny in absolute terms (0.014% of tracked traffic), but the fastest-growing AI source in their dataset. AI platforms combined accounted for 0.33% of traffic in April 2026, up from 0.20% a year earlier.
Set up GA4 segments for chatgpt.com, perplexity.ai, gemini.google.com, and similar referrers. Crawl volume tells you what AI reads; referral traffic tells you what AI sends.
How to block AI bots — without breaking AI visibility
Blocking is a legitimate choice — especially for paywalled content, licensed IP, or sites that do not want their content used for model training. But the configuration must be deliberate.
Option A: Block training, allow search (recommended default for most brands)
Block content from being scraped for model training. Stay visible in AI search answers.
# Block training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Allow AI search and retrieval
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: Claude-User
Allow: /
This follows the approach documented by LumenGEO, Digital Applied, and Molixa.
Option B: Block everything AI-related
Use this only if you genuinely want zero AI visibility — no ChatGPT citations, no Perplexity answers, no Google AI Overviews from your content:
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
Understand the trade-off: you are opting out of the fastest-growing discovery channel in search. For B2B brands in Asia where buyers increasingly start research in AI assistants, this is a strategic decision — not a technical default.
Option C: Selective path blocking
Allow bots site-wide but protect specific sections:
User-agent: GPTBot
Disallow: /account/
Disallow: /api/
Disallow: /internal/
User-agent: PerplexityBot
Allow: /
Useful for SaaS products that want marketing pages indexed but not authenticated app content.
Check your CDN — robots.txt may not be enough
Henry David Photography’s crawler audit documents a common failure mode: Cloudflare’s AI Crawl Control blocks bots at the edge even when robots.txt allows them. Over one million sites activated restrictive AI blocking without realising it, per Cloudflare Radar reporting cited in that analysis.
If you use Cloudflare, Vercel, or another CDN with bot management:
- Check Security → Bots → AI Crawl Control (Cloudflare) or equivalent
- Verify edge rules match your
robots.txtintent - Test with a fetch tool or log review —
robots.txtalone does not guarantee access
Blocking at the CDN layer silently removes you from AI search even when your robots file says Allow.
What this means for teams in Asia
Three Asia-specific considerations:
1. English-first crawling, multilingual buyers. Siri AI and most AI crawlers index English content first. If your Singapore or Hong Kong site has English pages crawled heavily but Japanese or Korean localised pages ignored, your country-level AI visibility data will diverge from crawl logs. Track both.
2. China is a separate ecosystem. Mainland China blocks or restricts many Western AI crawlers. Baidu, Doubao, and domestic LLMs use their own bot infrastructure. A robots.txt tuned for GPTBot and ClaudeBot does not address Chinese AI search — a different strategy entirely.
3. Crawl ≠ local answer. AI bots may crawl your global .com site from US infrastructure while buyers in Tokyo or Singapore see different answers from local IP retrieval. Browser-based collection from residential IPs remains the only reliable way to see what local buyers get — bot logs show what was indexed, not what was served.
A decision framework for marketing leaders
| Your goal | Recommended action |
|---|---|
| Maximise AI search visibility | Allow retrieval crawlers; optimise content for extractability; monitor SOV |
| Protect content from training | Block GPTBot, ClaudeBot, CCBot, Google-Extended; keep retrieval bots allowed |
| Reduce server load / costs | Block training crawlers (largest volume); rate-limit via CDN |
| Opt out of AI entirely | Block all AI user-agents + verify CDN settings |
| Understand before deciding | Audit server logs for 30 days; map crawled URLs to prompt library |
Most B2B and consumer brands selling in Asia should start with audit → allow retrieval → block training → measure visibility. Blocking everything by default is the equivalent of adding noindex to your site in 2010 because you did not understand Google yet.
FAQ
Will blocking GPTBot hurt my Google rankings?
No. GPTBot is separate from Googlebot. Blocking GPTBot, ClaudeBot, or CCBot has no effect on traditional Google search rankings. Blocking Google-Extended affects Gemini training data; blocking GoogleOther may affect Google AI Overviews — different bots, different outcomes. See SEOLint’s breakdown.
How do I see AI bot traffic in Google Analytics?
You generally cannot — GA4 filters most bot traffic. Use server logs, Cloudflare Analytics, or log analysis tools (OnCrawl, Botify, server-side dashboards) for crawl data. Use GA4 referral segments for AI click-through traffic only.
Is AI bot traffic causing my hosting bill to spike?
It can, especially on high-traffic sites with heavy training crawler activity. Blocking GPTBot and ClaudeBot (training) reduces volume significantly while preserving search visibility. CDN caching also helps — bots hitting cached pages cost less than origin requests.
Should I add an llms.txt file?
llms.txt is an emerging convention (similar to robots.txt) for declaring which content AI systems may use. It is not universally honoured yet, but Digital Applied’s access matrix recommends pairing it with robots.txt as bots begin supporting it. Low effort, potentially useful — not a substitute for robots.txt rules.
If bots crawl my site but I never appear in AI answers, what is wrong?
Usually one of: content not extractable (buried answers, poor structure), missing third-party citations AI trusts, wrong pages being crawled (blog vs product), or retrieval bots blocked at CDN level. Run visibility checks from your target country before assuming blocking is the fix.
Track whether AI search actually mentions your brand — not just whether bots crawl your pages. Start a free trial across six Asian markets.