Technical SEO for AI Platforms 2025: GPTBot, PerplexityBot, ClaudeBot Crawl Optimization ve Core Web Vitals

TL;DR

AI crawler landscape 2025: GPTBot %30 share (2024'te %5'ten büyüdü), PerplexityBot +157,490% request artışı, ClaudeBot %46 düşüş. Technical gereksinimler: Core Web Vitals critical (INP ≤200ms, LCP ≤2.5s, CLS ≤0.1), AI crawlers JavaScript render etmiyor (%80+ bots), server-side rendering zorunlu. Crawl budget: AI bots traditional Googlebot'tan %60 daha az crawl ediyor, sitemap optimization critical. Bu guide: Robots.txt configuration, AI-specific schema markup, site speed optimization, crawl budget mastery, measurement.

İlgili Kaynaklar: AI Bot Crawler Yönetimi Rehberi | Schema Markup & AI Arama Rehberi | ChatGPT Citation Optimizasyonu | Çoklu Platform GEO Optimizasyonu | B2B SaaS GEO Case Study

AI Crawler Landscape 2025

AI Bot Traffic Distribution (May 2024 vs May 2025):

Major AI Crawlers (Traffic Share):

GPTBot (OpenAI):
- May 2024: %5
- May 2025: %30 (+500% share increase)
- Behavior: Training crawler (static data harvest for GPT-5, GPT-6 training)
- Crawl frequency: Weekly full site crawl (high-priority sites), monthly (low-priority)

Meta-ExternalAgent (Meta AI):
- May 2024: Not tracked (negligible)
- May 2025: %19 (strong entry, Meta AI/Llama investment)
- Behavior: Training crawler (Meta Llama 4 training data)
- Crawl frequency: Bi-weekly

ClaudeBot (Anthropic):
- May 2024: %11.7
- May 2025: %5.4 (-54% share decline, -46% request count)
- Why decline? Anthropic shifted to selective crawling (quality > quantity), partnerships (licensed data)
- Behavior: Index refinement (not heavy training, more curated)
- Crawl frequency: Monthly

PerplexityBot (Perplexity AI):
- May 2024: <1% (tiny)
- May 2025: %8 (+157,490% raw request increase!)
- Why explosion? Perplexity growth (22M users), real-time web search dependency
- Behavior: RAG crawler (real-time retrieval for search answers, not training)
- Crawl frequency: Daily (fresh content sites), weekly (evergreen)

Google-Extended (Google AI/Bard/Gemini):
- May 2024: %18
- May 2025: %22 (+22% growth)
- Behavior: Training crawler (separate from Googlebot, Gemini/Bard training)
- Crawl frequency: Weekly

Bingbot + Microsoft AI:
- May 2024: %12
- May 2025: %10 (-16% decline, Bing losing to ChatGPT)
- Behavior: Mixed (Bing search index + Copilot training)

Bytespider (ByteDance/TikTok AI):
- May 2024: %8
- May 2025: %6 (-25%)
- Behavior: TikTok content recommendations, Douyin AI (China market focus)

Training Crawlers vs RAG Crawlers:

Training Crawlers (Static Data Harvest):

Purpose: Build LLM's world knowledge (pre-training, fine-tuning)
Crawl pattern: Broad, deep (entire site, all pages)
Frequency: Monthly/quarterly (not real-time)
Examples: GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent

What they want:
- Comprehensive content (long-form articles, guides, documentation)
- Evergreen knowledge (not news, time-sensitive data)
- Clean HTML (no JavaScript dependency)
- Structured data (schema markup for entity extraction)

RAG Crawlers (Real-Time Retrieval):

Purpose: Pull fresh content for live search/chat answers
Crawl pattern: Targeted, frequent (high-value pages, fresh content)
Frequency: Daily/hourly (real-time dependency)
Examples: PerplexityBot, OAI-SearchBot (ChatGPT search feature), Claude-User

What they want:
- Fresh content (news, statistics, recent updates)
- Fast loading (speed critical, timeout = skip page)
- Direct answers (FAQ format, clear structure)
- Citation-friendly (clear authorship, publish dates)

Technical Implication:

Training crawlers: Optimize entire site (deep crawl, all pages valuable)
RAG crawlers: Optimize high-value pages (priority: homepage, pillar content, fresh articles)

Budget allocation:
- Training crawlers: 60% effort (broad site health)
- RAG crawlers: 40% effort (targeted page optimization)

Robots.txt Optimization for AI Crawlers

AI Bot User Agents (Complete List 2025)

OpenAI Bots:

User-agent: GPTBot
# Purpose: Model training (GPT-4, GPT-5, future models)
# Respects robots.txt: Yes
# JavaScript rendering: No

User-agent: OAI-SearchBot
# Purpose: ChatGPT search feature (real-time web search)
# Respects robots.txt: Yes
# JavaScript rendering: Partial (limited)

User-agent: ChatGPT-User
# Purpose: User-driven browsing (when user asks ChatGPT to visit URL)
# Respects robots.txt: Yes
# JavaScript rendering: Yes (headless browser)

Allow Strategy (Maximum Visibility):
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

Anthropic Bots:

User-agent: ClaudeBot
# Purpose: Index refinement (not heavy training, curated crawl)
# Respects robots.txt: Yes
# JavaScript rendering: No

User-agent: Claude-Web
# Purpose: Web-focused crawl (broader than ClaudeBot)
# Respects robots.txt: Yes

User-agent: Claude-User
# Purpose: On-demand fetch (when user asks Claude to visit URL)
# Respects robots.txt: Yes
# JavaScript rendering: Yes (headless browser)

Allow Strategy:
User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Claude-User
Allow: /

Perplexity Bots:

User-agent: PerplexityBot
# Purpose: Perplexity AI search index (RAG, real-time search)
# Respects robots.txt: Yes (officially, controversy exists)
# JavaScript rendering: No
# Crawl rate: High (daily for fresh content)

Allow Strategy:
User-agent: PerplexityBot
Allow: /

Note: Perplexity faced controversy (2024) for allegedly bypassing robots.txt blocks
Official statement: "We respect robots.txt" (May 2024 update)

Google Bots:

User-agent: Googlebot
# Purpose: Google Search index (traditional SEO)
# Respects robots.txt: Yes
# JavaScript rendering: Yes (full rendering since 2019)

User-agent: Google-Extended
# Purpose: Google AI training (Bard, Gemini, future models)
# Respects robots.txt: Yes
# JavaScript rendering: No (separate from Googlebot)

Important Distinction:
- Blocking Google-Extended: Blocks AI training (Gemini), but Google Search still indexes (Googlebot allowed)
- Blocking Googlebot: Blocks Google Search + AI (both blocked)

Strategy (Maximum Visibility):
User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Allow: /

Strategy (Block AI Training, Allow Search):
User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

# Why? Some publishers want Google Search traffic but not AI training
# Trade-off: Google AI Overviews may not cite you (Google-Extended blocked)

Complete Robots.txt Template (AI-Friendly):

# Robots.txt for Maximum AI Visibility (2025)

# Allow all major AI crawlers
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Disallow low-value pages (all crawlers)
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /api/
Disallow: /private/

# Allow high-value content
Allow: /blog/
Allow: /guides/
Allow: /resources/
Allow: /products/

# Crawl-delay (optional, reduce server load)
# Note: GPTBot, ClaudeBot don't support Crawl-delay (ignore it)
# Googlebot respects it
User-agent: *
Crawl-delay: 1

# Sitemap (critical for AI crawlers, discovery)
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml
Sitemap: https://example.com/product-sitemap.xml

Selective Blocking (Protecting Proprietary Content):

Scenario: Block AI training but allow search indexing

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Why?
- Proprietary content (e-books, courses, research)
- Competitive advantage (don't want AI models trained on your unique data)
- Legal concerns (copyright, licensing)

Trade-off:
- AI platforms (ChatGPT, Claude, Gemini) won't cite you (no training data)
- Google/Bing Search still indexes (organic traffic preserved)

Who does this?
- NYT, WSJ (blocked GPTBot 2023, then negotiated licensing deals)
- Reddit (blocked GPTBot, then $60M Google deal)
- Stack Overflow (blocked GPTBot 2023, protecting Q&A data)

Core Web Vitals for AI Platforms

INP (Interaction to Next Paint) - 2024's New Metric

INP Basics:

What is INP?
Interaction to Next Paint = responsiveness metric
Measures: Time from user interaction (click, tap, keyboard) to next visual update

Replaced: FID (First Input Delay) in March 2024
Why? FID only measured first interaction, INP measures all interactions

Target:
- Good: ≤200ms
- Needs improvement: 200-500ms
- Poor: >500ms

Why AI Platforms Care:
- RAG crawlers (PerplexityBot, OAI-SearchBot) simulate user interaction (click "Read More", expand FAQ)
- Poor INP = crawler timeout, page skip (citation loss)
- Google AI Overviews: INP affects Google Search ranking → affects AI Overview inclusion

Measurement:
- Chrome DevTools → Performance tab → Interactions
- PageSpeed Insights (Google): INP score
- Web Vitals Extension (Chrome): Real-time INP

INP Optimization Tactics:

1. Reduce JavaScript Execution Time:

Problem: Heavy JavaScript blocks main thread, delays interaction response

Example:
User clicks FAQ accordion → JavaScript runs 800ms → Accordion expands
INP: 800ms (Poor!)

Solution: Code splitting, defer non-critical JS

Before (Poor INP):
<script src="bundle.js"></script> <!-- 2MB bundle, blocks main thread -->

After (Good INP):
<script src="critical.js"></script> <!-- 50KB, essential only -->
<script defer src="non-critical.js"></script> <!-- Loads after page interactive -->

Result: INP 800ms → 120ms (-85% improvement)

2. Optimize Event Handlers:

Problem: Heavy event listeners (click, scroll) delay response

Bad code:
document.querySelector('.button').addEventListener('click', function() {
  // Synchronous, heavy operation (200ms)
  processLargeDataset();
  updateUI();
});

Good code:
document.querySelector('.button').addEventListener('click', async function() {
  // Show immediate feedback (instant)
  showLoadingSpinner();

  // Async operation (non-blocking)
  await processLargeDataset();
  updateUI();
  hideLoadingSpinner();
});

Result: User sees instant response (spinner), perceived INP <50ms

3. Web Workers (Offload Heavy Computation):

Problem: JavaScript computation blocks main thread

Solution: Web Workers (separate thread)

// Main thread (blocks INP)
function heavyCalculation() {
  // 500ms computation
  let result = complexAlgorithm(largeData);
  updateUI(result);
}

// Web Worker (doesn't block INP)
// main.js
const worker = new Worker('worker.js');
worker.postMessage(largeData);

worker.onmessage = function(e) {
  updateUI(e.data); // Update UI with result
};

// worker.js
onmessage = function(e) {
  let result = complexAlgorithm(e.data);
  postMessage(result);
};

Result: Main thread free, INP unaffected

4. Debounce/Throttle Input Handlers:

Problem: Search autocomplete fires 10 events/second (typing "artificial intelligence")

Bad (no debounce):
input.addEventListener('input', function(e) {
  // Fires 10x/second, API call each time
  fetchSearchResults(e.target.value);
});

Good (debounced):
import debounce from 'lodash/debounce';

input.addEventListener('input', debounce(function(e) {
  // Fires once after 300ms pause
  fetchSearchResults(e.target.value);
}, 300));

Result: 10 API calls → 1 API call, INP improved

LCP (Largest Contentful Paint) Optimization:

Target: ≤2.5 seconds

What is LCP?
Largest visible element load time (hero image, H1 heading, video)

AI Platform Impact:
- Slow LCP = crawlers timeout (GPTBot, PerplexityBot wait max 5-8 seconds)
- Google AI Overviews: LCP affects ranking → affects citation

Common LCP Elements:
- Hero image (homepage banner, article header image)
- H1 heading with background image
- Video (above-the-fold)
- Large text block (article content)

Optimization Tactics:

1. Image Optimization:

Before (Poor LCP):
<img src="hero.jpg" width="1920" height="1080"> <!-- 2.5MB, uncompressed -->

After (Good LCP):
<img
  src="hero.webp"
  srcset="hero-480w.webp 480w, hero-960w.webp 960w, hero-1920w.webp 1920w"
  sizes="(max-width: 480px) 480px, (max-width: 960px) 960px, 1920px"
  width="1920"
  height="1080"
  loading="eager"
  fetchpriority="high"
  alt="Hero image">
<!-- WebP format (70% smaller), responsive images, priority hints -->

Result: 2.5MB → 400KB, LCP 4.2s → 1.8s

2. Preload Critical Resources:

<head>
  <!-- Preload hero image (highest priority) -->
  <link rel="preload" as="image" href="hero.webp" fetchpriority="high">

  <!-- Preload critical fonts (avoid FOIT/FOUT) -->
  <link rel="preload" as="font" href="fonts/inter-bold.woff2" type="font/woff2" crossorigin>

  <!-- Preconnect to external domains (CDN, API) -->
  <link rel="preconnect" href="https://cdn.example.com">
</head>

Result: Browser prioritizes LCP resources, faster render

3. Remove Render-Blocking Resources:

Before (Blocks LCP):
<head>
  <link rel="stylesheet" href="styles.css"> <!-- Blocks rendering -->
  <script src="analytics.js"></script> <!-- Blocks rendering -->
</head>

After (Non-Blocking):
<head>
  <!-- Inline critical CSS (above-the-fold styles) -->
  <style>
    /* Critical CSS (header, hero, navigation) */
    .hero { background: url('hero.webp'); height: 600px; }
    h1 { font-size: 48px; color: #000; }
  </style>

  <!-- Defer non-critical CSS -->
  <link rel="preload" as="style" href="styles.css" onload="this.onload=null;this.rel='stylesheet'">

  <!-- Defer JavaScript -->
  <script defer src="analytics.js"></script>
</head>

Result: LCP elements render immediately (no blocking)

4. CDN + Caching:

Without CDN:
- Server: US East (Virginia)
- User: Tokyo
- Latency: 280ms (round-trip)
- LCP image download: 1.8s

With CDN (Cloudflare, AWS CloudFront):
- Edge server: Tokyo
- Latency: 15ms
- LCP image download: 380ms

Result: LCP 3.2s → 1.4s (-56% improvement)

CDN Setup:
1. Sign up: Cloudflare (free tier), AWS CloudFront, Fastly
2. Configure DNS (point to CDN)
3. Set cache headers:

# .htaccess (Apache) or nginx.conf
<IfModule mod_expires.c>
  ExpiresActive On
  ExpiresByType image/webp "access plus 1 year"
  ExpiresByType text/css "access plus 1 month"
  ExpiresByType application/javascript "access plus 1 month"
</IfModule>

4. Purge cache when content updates

CLS (Cumulative Layout Shift) Optimization:

Target: ≤0.1

What is CLS?
Layout stability = elements don't shift unexpectedly during page load

AI Platform Impact:
- Layout shifts confuse crawlers (element positions change, extraction errors)
- Poor UX signal (Google ranking factor, affects AI Overview inclusion)

Common CLS Causes:

1. Images Without Dimensions:

Bad (Causes CLS):
<img src="article.jpg" alt="Article image">
<!-- Browser doesn't know size, reserves no space, shifts when loaded -->

Good (No CLS):
<img src="article.jpg" alt="Article image" width="800" height="600">
<!-- Browser reserves 800×600 space, no shift when image loads -->

Or use aspect-ratio CSS:
<img src="article.jpg" alt="Article image" style="aspect-ratio: 4/3; width: 100%;">

2. Ads/Embeds Without Reserved Space:

Bad:
<div id="ad-slot"></div>
<!-- Ad loads dynamically, pushes content down (CLS!) -->

Good:
<div id="ad-slot" style="min-height: 250px;">
  <!-- Reserved space, content doesn't shift -->
</div>

3. Web Fonts (FOIT/FOUT):

Problem: Font loads, text re-renders, layout shifts

Solution: font-display: swap + preload

<head>
  <link rel="preload" as="font" href="inter.woff2" type="font/woff2" crossorigin>

  <style>
    @font-face {
      font-family: 'Inter';
      src: url('inter.woff2') format('woff2');
      font-display: swap; /* Show fallback font immediately, swap when loaded */
    }
  </style>
</head>

4. Dynamic Content Injection:

Bad:
<!-- Content loads, banner injected at top, pushes everything down -->
<div id="top-banner"></div>
<main>Article content...</main>

<script>
  // Injects banner after page load (CLS!)
  document.getElementById('top-banner').innerHTML = '<div>Newsletter signup!</div>';
</script>

Good:
<!-- Reserve space with skeleton/placeholder -->
<div id="top-banner" style="min-height: 100px;">
  <div class="skeleton-loader"></div> <!-- Placeholder -->
</div>

<script>
  // Replaces skeleton, no layout shift
  document.getElementById('top-banner').innerHTML = '<div>Newsletter signup!</div>';
</script>

CLS Measurement:
- Chrome DevTools → Performance → Experience → Layout Shifts (red bars)
- PageSpeed Insights: CLS score + screenshot of shifting elements
- Web Vitals Extension: Real-time CLS

JavaScript Rendering Problem

AI Crawlers Don't Execute JavaScript

The Problem:

Reality Check (2025):

- 80%+ AI crawlers don't render JavaScript (GPTBot, ClaudeBot, PerplexityBot, Google-Extended)
- Only user-driven bots render JS (ChatGPT-User, Claude-User when user asks to visit URL)
- Googlebot renders JS (full rendering since 2019), but Google-Extended (AI training) doesn't

Implication:
- React/Vue/Angular SPA (Single Page Application): AI crawlers see empty page
- Client-side content loading: Invisible to AI
- JavaScript-based navigation: Broken links for AI

Example (React SPA):

HTML delivered to crawler:
<!DOCTYPE html>
<html>
<head><title>My Blog</title></head>
<body>
  <div id="root"></div>
  <script src="bundle.js"></script>
</body>
</html>

What GPTBot sees:
Empty <div id="root"> (no content!)

What should be there (after JS executes):
<div id="root">
  <h1>10 Best AI Tools for 2025</h1>
  <p>Artificial intelligence is transforming...</p>
  ...
</div>

Result: GPTBot sees no content, skips page, no citation

Solutions:

1. Server-Side Rendering (SSR):

What is SSR?
Server generates HTML (fully rendered) before sending to browser/crawler

Frameworks with SSR:
- Next.js (React): Industry standard for React SSR
- Nuxt.js (Vue): Vue equivalent
- SvelteKit (Svelte): Lightweight SSR
- Angular Universal (Angular): Angular SSR

Example (Next.js):

// pages/blog/[slug].js (Next.js SSR)
export async function getServerSideProps(context) {
  // Fetch data on server (not client)
  const post = await fetchBlogPost(context.params.slug);

  return {
    props: { post } // Pass to component
  };
}

export default function BlogPost({ post }) {
  return (
    <article>
      <h1>{post.title}</h1>
      <div dangerouslySetInnerHTML={{ __html: post.content }} />
    </article>
  );
}

What crawler sees:
<!DOCTYPE html>
<html>
<head><title>10 Best AI Tools for 2025</title></head>
<body>
  <article>
    <h1>10 Best AI Tools for 2025</h1>
    <div>
      <p>Artificial intelligence is transforming...</p>
      ...
    </div>
  </article>
</body>
</html>

Result: Full content in HTML, GPTBot/ClaudeBot/PerplexityBot can crawl

SSR Benefits for AI:
✅ Full content accessible (no JavaScript required)
✅ Faster initial load (content rendered server-side)
✅ SEO + GEO friendly (all crawlers can index)

SSR Drawbacks:
❌ Server cost (CPU usage higher, more server rendering)
❌ Complexity (SSR setup more complex than client-side)
❌ TTFB (Time to First Byte) higher (server processing time)

2. Static Site Generation (SSG):

What is SSG?
Pre-render all pages at build time (HTML files generated)

Best for:
- Blogs, documentation, marketing sites (content doesn't change frequently)
- 10-10,000 pages (feasible to pre-render)

Frameworks:
- Next.js (getStaticProps, getStaticPaths)
- Gatsby (React)
- Hugo (Go-based, fastest)
- Jekyll (Ruby, GitHub Pages)
- Astro (modern, multi-framework)

Example (Next.js SSG):

// pages/blog/[slug].js
export async function getStaticProps({ params }) {
  const post = await fetchBlogPost(params.slug);
  return { props: { post } };
}

export async function getStaticPaths() {
  const posts = await fetchAllBlogPosts();
  const paths = posts.map(post => ({ params: { slug: post.slug } }));
  return { paths, fallback: false };
}

export default function BlogPost({ post }) {
  return <article><h1>{post.title}</h1>...</article>;
}

Build process:
npm run build
→ Generates static HTML files:
   out/blog/ai-tools-2025.html
   out/blog/seo-trends.html
   ...

Deploy: Upload HTML files to CDN (Vercel, Netlify, AWS S3 + CloudFront)

SSG Benefits:
✅ Fastest performance (static HTML, no server rendering)
✅ Cheapest hosting (static files, CDN-only)
✅ Perfect for crawlers (full HTML, instant)

SSG Drawbacks:
❌ Build time (1,000 pages = 5-10 minutes build)
❌ Not real-time (content updates require rebuild)
❌ Dynamic content limitation (user-specific content difficult)

3. Dynamic Rendering (Hybrid Approach):

What is Dynamic Rendering?
Serve SSR/SSG HTML to crawlers, client-side JS to users

How it works:
1. Detect user-agent (is it a crawler?)
2. If crawler → Serve pre-rendered HTML
3. If user → Serve client-side JS app

Implementation:

// middleware.js (Next.js)
export function middleware(request) {
  const userAgent = request.headers.get('user-agent');

  // List of crawler user-agents
  const crawlers = ['Googlebot', 'GPTBot', 'ClaudeBot', 'PerplexityBot', 'Google-Extended'];

  const isCrawler = crawlers.some(bot => userAgent.includes(bot));

  if (isCrawler) {
    // Serve pre-rendered HTML (SSR or cached HTML)
    return NextResponse.rewrite(new URL('/prerendered' + request.nextUrl.pathname, request.url));
  }

  // Serve client-side app (React SPA)
  return NextResponse.next();
}

Tools for Dynamic Rendering:
- Prerender.io ($20-200/mo): Crawler detection + pre-rendering service
- Rendertron (Google, open-source): Headless Chrome rendering
- Puppeteer + Cache: DIY solution (render with Puppeteer, cache HTML)

Pros:
✅ Best of both worlds (fast SPA for users, full HTML for crawlers)
✅ No rebuild needed (real-time content updates)

Cons:
❌ Complexity (middleware, cache management)
❌ Cost (pre-rendering service or server resources)
❌ Cloaking risk (Google penalizes showing different content to crawlers vs users, but acceptable if content equivalent)

4. Progressive Enhancement (Fallback Approach):

Concept: Start with semantic HTML, enhance with JavaScript

Example (FAQ Accordion):

HTML (no JavaScript):
<section>
  <h2>Frequently Asked Questions</h2>

  <details>
    <summary>What is GEO?</summary>
    <p>Generative Engine Optimization (GEO) is the practice of optimizing content for AI platforms like ChatGPT, Perplexity, and Claude...</p>
  </details>

  <details>
    <summary>How is GEO different from SEO?</summary>
    <p>SEO focuses on ranking in Google search results, while GEO focuses on getting cited by AI platforms...</p>
  </details>
</section>

JavaScript Enhancement (for users):
<script>
  // Convert <details> to fancy accordion (animated, styled)
  document.querySelectorAll('details').forEach(detail => {
    // Add animations, custom styling, analytics tracking
    enhanceAccordion(detail);
  });
</script>

What crawlers see:
Full content in <details> tags (HTML semantic, no JS required)

What users see:
Enhanced accordion (animations, better UX)

Progressive Enhancement Benefits:
✅ Crawlers get full content (HTML fallback)
✅ Users get enhanced experience (JavaScript enhancements)
✅ Resilient (JavaScript fails? HTML fallback works)

When to use:
- Interactive components (accordions, tabs, modals)
- Form enhancements (validation, autocomplete)
- Navigation (dropdowns, mega-menus)

Crawl Budget Optimization

Understanding AI Crawler Behavior

Crawl Budget Basics:

What is Crawl Budget?
Number of pages a crawler will fetch from your site in given time

Googlebot Crawl Budget:
- Large site (1M+ pages): 10K-100K pages/day
- Medium site (10K-100K pages): 1K-10K pages/day
- Small site (<10K pages): 100-1K pages/day

AI Crawler Crawl Budget (Estimated):
- GPTBot: 40% of Googlebot rate (slower, more selective)
- PerplexityBot: 60% of Googlebot (faster, RAG dependency)
- ClaudeBot: 30% of Googlebot (very selective, quality over quantity)
- Google-Extended: 50% of Googlebot

Why Lower?
- AI crawlers focus on quality (not comprehensive indexing like Google)
- Server costs (AI companies smaller than Google, less infrastructure)
- Selective crawling (training data curation, not every page valuable)

Implication:
- Prioritize high-value pages (AI crawlers may not crawl entire site)
- Reduce waste (duplicate pages, low-quality content consume budget)

Crawl Budget Optimization Tactics:

1. XML Sitemap Optimization:

Priority: Tell crawlers which pages matter most

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <!-- High-priority pages (pillar content, guides) -->
  <url>
    <loc>https://example.com/complete-geo-guide</loc>
    <lastmod>2025-01-09</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority> <!-- Highest priority -->
  </url>

  <!-- Medium-priority pages (blog posts) -->
  <url>
    <loc>https://example.com/blog/chatgpt-optimization</loc>
    <lastmod>2025-01-05</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>

  <!-- Low-priority pages (tag pages, archives) -->
  <url>
    <loc>https://example.com/tag/seo</loc>
    <lastmod>2024-12-01</lastmod>
    <changefreq>yearly</changefreq>
    <priority>0.3</priority> <!-- Low priority -->
  </url>

</urlset>

Sitemap Best Practices:
✅ Split large sitemaps (max 50K URLs per sitemap, max 50MB file size)
✅ Update lastmod accurately (crawlers prioritize recently updated pages)
✅ Use priority wisely (not all pages 1.0, differentiate value)
✅ Multiple sitemaps:
   - sitemap-blog.xml (blog posts)
   - sitemap-guides.xml (pillar content)
   - sitemap-products.xml (product pages)

Submit to:
- Google Search Console (Googlebot + Google-Extended)
- Bing Webmaster Tools (Bingbot)
- robots.txt (all crawlers discover)

Sitemap: https://example.com/sitemap-index.xml

2. Remove Duplicate Content:

Problem: Duplicate pages waste crawl budget

Common duplicates:
- URL parameters: /product?color=red vs /product?color=blue (same content)
- HTTP vs HTTPS: http://example.com vs https://example.com
- www vs non-www: www.example.com vs example.com
- Trailing slash: /about vs /about/
- Mobile vs desktop: m.example.com vs example.com

Solutions:

a) Canonical tags (tell crawlers which version is primary):
<link rel="canonical" href="https://example.com/product">

b) 301 redirects (redirect duplicates to canonical):
# .htaccess (Apache)
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule ^(.*)$ https://example.com/$1 [R=301,L]

c) robots.txt (block low-value parameter pages):
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

3. Fix Crawl Errors:

Monitor Google Search Console → Coverage report:
- 404 errors: Remove broken links, or 301 redirect to relevant page
- 500 errors: Server errors (fix ASAP, crawlers deprioritize error-prone sites)
- Soft 404s: Pages returning 200 but have no content (fix or remove)

AI Crawler Specific:
- Check server logs for GPTBot, PerplexityBot, ClaudeBot 4xx/5xx errors
- Timeout errors (page loads >8s, crawler gives up): Optimize speed

4. Reduce Low-Value Pages:

Audit: Which pages get crawled but no traffic/citations?

Examples:
- Tag pages (100 tags = 100 low-value pages)
- Author archives (10 authors = 10 pages, minimal content)
- Date archives (/2024/01/, /2024/02/, ... /2024/12/ = 12 pages)
- Pagination (page 50 of blog = zero value)

Solutions:
- Noindex low-value pages: <meta name="robots" content="noindex, follow">
- Consolidate (remove excessive tags, keep top 10-20 most used)
- Robots.txt block (Disallow: /author/, Disallow: /tag/)

Result: Crawl budget focused on high-value content (pillar posts, guides)

5. Internal Linking Optimization:

Crawlers prioritize pages with more internal links (signal of importance)

Hub-and-Spoke Model:
- Hub (pillar content): "Complete GEO Guide 2025"
  → Spoke 1: "ChatGPT Optimization"
  → Spoke 2: "Perplexity Optimization"
  → Spoke 3: "Claude Optimization"

Pillar page: 50 internal links pointing to it (high priority)
Spoke pages: 10 internal links each (medium priority)

Implementation:
- Navigation menu: Link to pillar pages (every page links to them)
- Contextual links: In-content links to related articles
- Footer links: "Popular Resources" section (links to top 5-10 pages)

Tool: Screaming Frog SEO Spider → Internal links report (identify orphan pages, low link count pages)

Schema Markup Mastery for AI

Advanced Schema Types (AI Platform Parsing)

Article Schema (Comprehensive):

json
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Technical SEO for AI Platforms 2025: Complete Guide",
  "description": "AI crawler optimization: GPTBot, PerplexityBot, ClaudeBot crawl optimization, Core Web Vitals, schema markup.",
  "image": [
    "https://example.com/images/technical-seo-ai-1x1.jpg",
    "https://example.com/images/technical-seo-ai-4x3.jpg",
    "https://example.com/images/technical-seo-ai-16x9.jpg"
  ],
  "datePublished": "2025-01-09T08:00:00+00:00",
  "dateModified": "2025-01-09T08:00:00+00:00",
  "author": {
    "@type": "Person",
    "name": "John Doe",
    "url": "https://example.com/author/john-doe",
    "sameAs": [
      "https://twitter.com/johndoe",
      "https://linkedin.com/in/johndoe"
    ]
  },
  "publisher": {
    "@type": "Organization",
    "name": "AIseo Optimizer",
    "logo": {
      "@type": "ImageObject",
      "url": "https://example.com/logo.png",
      "width": 600,
      "height": 60
    }
  },
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://example.com/blog/technical-seo-ai-platforms"
  },
  "articleSection": "Technical SEO",
  "keywords": "Technical SEO, AI crawlers, GPTBot, PerplexityBot, Core Web Vitals, schema markup",
  "wordCount": 8500,
  "inLanguage": "tr-TR",
  "copyrightYear": 2025,
  "copyrightHolder": {
    "@type": "Organization",
    "name": "AIseo Optimizer"
  }
}

Why Each Field Matters (AI Platforms):

headline: ChatGPT/Perplexity extract as title in citation
description: Used in AI answer summaries
image: Multi-aspect ratio = Gemini multi-modal search, ChatGPT visual answers
datePublished: Perplexity freshness ranking (recent = higher priority)
dateModified: Updates signal (frequent updates = active content)
author: E-E-A-T signal (credentials verification)
publisher: Brand recognition (established publisher = trust)
articleSection: Topic clustering (AI understands content category)
keywords: Semantic signals (AI topic modeling)
wordCount: Depth signal (longer = comprehensive, but quality matters)

HowTo Schema (Step-by-Step Guides):

json
{
  "@context": "https://schema.org",
  "@type": "HowTo",
  "name": "How to Optimize robots.txt for AI Crawlers",
  "description": "Step-by-step guide to configure robots.txt for GPTBot, PerplexityBot, ClaudeBot, and Google-Extended.",
  "image": "https://example.com/images/robots-txt-guide.jpg",
  "totalTime": "PT30M",
  "estimatedCost": {
    "@type": "MonetaryAmount",
    "currency": "USD",
    "value": "0"
  },
  "tool": [
    {
      "@type": "HowToTool",
      "name": "Text editor (VS Code, Sublime Text, Notepad++)"
    },
    {
      "@type": "HowToTool",
      "name": "FTP client or hosting control panel"
    }
  ],
  "step": [
    {
      "@type": "HowToStep",
      "name": "Create robots.txt file",
      "text": "Create a plain text file named robots.txt in your website's root directory.",
      "image": "https://example.com/images/step1-create-file.jpg",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step1"
    },
    {
      "@type": "HowToStep",
      "name": "Add AI crawler user-agents",
      "text": "Add User-agent directives for GPTBot, ClaudeBot, PerplexityBot, and Google-Extended.",
      "image": "https://example.com/images/step2-add-agents.jpg",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step2"
    },
    {
      "@type": "HowToStep",
      "name": "Configure Allow/Disallow rules",
      "text": "Set Allow: / to permit crawling, or Disallow: / to block. Add specific path rules as needed.",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step3"
    },
    {
      "@type": "HowToStep",
      "name": "Upload robots.txt",
      "text": "Upload the file to your site's root (https://yourdomain.com/robots.txt).",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step4"
    },
    {
      "@type": "HowToStep",
      "name": "Test robots.txt",
      "text": "Use Google Search Console Robots.txt Tester or visit https://yourdomain.com/robots.txt to verify.",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step5"
    }
  ]
}

Why HowTo Schema Works (AI Citations):

ChatGPT: Extracts step-by-step instructions for conversational guidance
Perplexity: Cites HowTo content for procedural queries ("how to optimize robots.txt")
Claude: Values structured instructions (technical accuracy preference)
Google AI Overviews: HowTo rich results (featured snippets, visual steps)

Implementation: JSON-LD in <head> or <script> tag at bottom of <body>

FAQ Schema (Voice Search + AI Conversational Queries):

json
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is GPTBot?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "GPTBot is OpenAI's web crawler that collects data for training GPT models (GPT-4, GPT-5, future versions). It respects robots.txt and can be allowed or blocked using User-agent: GPTBot directives."
      }
    },
    {
      "@type": "Question",
      "name": "Should I block Google-Extended?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Blocking Google-Extended prevents your content from being used to train Google AI models (Gemini, Bard) but doesn't affect Google Search indexing (Googlebot is separate). Block if you want to protect proprietary content while maintaining search visibility."
      }
    },
    {
      "@type": "Question",
      "name": "How often does PerplexityBot crawl?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "PerplexityBot crawls daily for fresh content sites (news, blogs) and weekly for evergreen content. It's a RAG crawler (real-time retrieval) so it prioritizes recently updated pages for Perplexity AI search answers."
      }
    }
  ]
}

Measurement ve Monitoring

Server Log Analysis (AI Crawler Tracking)

Apache/Nginx Log Analysis:

bash
# Apache access.log example
grep "GPTBot" /var/log/apache2/access.log | wc -l
# Output: 1,247 (GPTBot requests in log file)

# Analyze most crawled pages (GPTBot)
grep "GPTBot" /var/log/apache2/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

# Output:
# 342 /blog/chatgpt-optimization
# 215 /blog/perplexity-seo-guide
# 187 /complete-geo-guide
# ...

# Check response codes (4xx, 5xx errors)
grep "GPTBot" /var/log/apache2/access.log | grep " 404 " | wc -l
# Output: 23 (GPTBot encountered 23 404 errors, fix these!)

# Crawl frequency (GPTBot requests per day)
grep "GPTBot" /var/log/apache2/access.log | awk '{print $4}' | cut -d: -f1 | uniq -c

# Output:
# 45 [09/Jan/2025
# 52 [08/Jan/2025
# 38 [07/Jan/2025

Log Analysis Tools:

Free Tools:
- GoAccess (open-source, real-time web log analyzer)
  Installation: sudo apt-get install goaccess
  Usage: goaccess /var/log/apache2/access.log -o report.html --log-format=COMBINED

- AWStats (classic web analytics, bot detection)

Paid Tools:
- Splunk ($15-150/mo): Enterprise log management
- Loggly ($79+/mo): Cloud-based log analysis
- Datadog ($15+/host/mo): Infrastructure monitoring + logs

What to Track:
✅ Crawler visits per day (trend: increasing = good)
✅ Pages crawled (which content AI bots prioritize?)
✅ Errors (404, 500 = fix immediately, waste crawl budget)
✅ Crawl depth (how deep into site structure?)
✅ Bandwidth usage (AI crawlers consuming resources?)

Google Search Console (Googlebot + Google-Extended):

Settings → Crawl Stats:

Metrics:
- Total crawl requests (requests/day)
- Total download size (KB/day)
- Average response time (ms)

What to monitor:
- Crawl requests dropping? (Check robots.txt, server errors)
- Response time increasing? (Site speed issues, server overload)
- Download size spiking? (Large files, unoptimized images)

Googlebot vs Google-Extended:
- Google Search Console shows Googlebot only (not Google-Extended separately)
- To track Google-Extended: Server logs (grep "Google-Extended")

Actionable 60-Day Technical SEO Roadmap

Phase 1: Foundation (Days 1-20)

Week 1: Audit

✅ Core Web Vitals audit:
   - PageSpeed Insights (desktop + mobile)
   - Identify LCP, INP, CLS issues
   - Benchmark: Current scores

✅ JavaScript rendering test:
   - View source (curl https://yourdomain.com)
   - If empty <body> → SSR/SSG needed

✅ Robots.txt audit:
   - Check current robots.txt (https://yourdomain.com/robots.txt)
   - Are AI crawlers allowed/blocked?

✅ Schema audit:
   - Google Rich Results Test (search.google.com/test/rich-results)
   - Missing schemas? Errors?

Week 2-3: Core Web Vitals Fixes

✅ LCP optimization:
   - Image compression (WebP format)
   - CDN setup (Cloudflare free tier)
   - Preload critical resources

✅ INP optimization:
   - Defer non-critical JavaScript
   - Code splitting (if using React/Vue)
   - Debounce input handlers

✅ CLS fixes:
   - Add dimensions to all images
   - Reserve space for ads/embeds
   - Font-display: swap

Target: All Core Web Vitals "Good" (green)

Budget: $100-300 (CDN, image optimization tools)

Phase 2: AI Crawler Optimization (Days 21-40)

Week 4: Robots.txt + Sitemap

✅ Update robots.txt:
   - Allow GPTBot, ClaudeBot, PerplexityBot, Google-Extended
   - Disallow low-value paths (/admin/, /cart/)
   - Add sitemap reference

✅ XML sitemap optimization:
   - Split by content type (blog, guides, products)
   - Set priorities (1.0 for pillar content, 0.3 for tags)
   - Submit to Google Search Console

Week 5-6: JavaScript Rendering

✅ If SPA (React/Vue/Angular):
   Option A: Migrate to Next.js/Nuxt.js (SSR)
   Option B: Static site generation (Gatsby, Hugo)
   Option C: Dynamic rendering (Prerender.io, $20/mo)

✅ If WordPress/traditional CMS:
   - Ensure content in HTML (not lazy-loaded via JS)
   - Remove unnecessary JavaScript

Budget: $500-2,000 (SSR migration) or $20-200/mo (dynamic rendering service)

Phase 3: Schema + Monitoring (Days 41-60)

Week 7: Schema Implementation

✅ Article schema (all blog posts)
✅ HowTo schema (step-by-step guides)
✅ FAQ schema (FAQ pages, Q&A content)
✅ Organization schema (homepage)
✅ Breadcrumb schema (navigation)

Tools:
- Schema.org markup generator (free)
- Google Tag Manager (inject schema dynamically)
- WordPress plugins (Yoast, Rank Math auto-generate)

Week 8: Monitoring Setup

✅ Server log analysis:
   - Install GoAccess or AWStats
   - Track GPTBot, PerplexityBot, ClaudeBot visits

✅ Google Search Console:
   - Monitor Core Web Vitals report (monthly)
   - Track coverage issues (404s, 500s)

✅ Quarterly review:
   - Re-run PageSpeed Insights
   - Check AI crawler visits (increasing trend?)
   - Citation audit (are optimizations working?)

Budget: $0-50 (log analysis tools, most are free)

60-Day Total Budget: $600-2,500 (depending on SSR migration choice)

Sonuç: Technical Foundation = AI Visibility

Key Takeaways:

AI crawlers farklı: JavaScript render etmiyor (%80+), SSR/SSG zorunlu.
Core Web Vitals critical: INP ≤200ms, LCP ≤2.5s, CLS ≤0.1 (Google + AI platforms ranking factor).
Crawl budget sınırlı: AI crawlers Googlebot'tan %40-60 daha az crawl ediyor, prioritizasyon şart.
Schema markup = AI parsing: Structured data AI platforms için extraction kolaylaştırıyor.
Robots.txt strategy: Allow AI crawlers (maximum visibility) veya selective blocking (proprietary content protection).

ROI Projeksiyonu:

Investment: $600-2,500 (60 gün, SSR migration included)

Expected Technical Improvements:
- Core Web Vitals: "Poor" → "Good" (%100+ score artışı)
- Page load: 4.5s → 1.8s (-60% improvement)
- AI crawler access: JavaScript bloğu kaldırıldı, full content access

SEO + GEO Impact:
- Google ranking: +5-15 pozisyon (Core Web Vitals improvement)
- AI citation rate: +40-70% (content accessibility artışı)
- Crawl budget efficiency: +120% (düşük kalite pages eliminated)

Revenue Impact (Example: B2B SaaS):
- Baseline organic traffic: 15,000/month
- After technical optimization: 22,500/month (+50%, ranking improvement)
- Conversion rate: 3.2%
- New leads: 240/month
- Lead value: $300 (average)
- Monthly revenue impact: $72,000

ROI: ($72,000 - $2,500) / $2,500 = 2,780% (ilk ay)
Ongoing: Technical debt eliminated, compound growth başlıyor

İlk Adım:

Core Web Vitals audit bugün yap (PageSpeed Insights, 5 dakika). "Poor" scores varsa → immediate priority. Technical SEO = foundation, GEO content üzerine inşa edilir.

Technical excellence + great content = AI platform dominance.

Kaynaklar

Cloudflare Blog: "From Googlebot to GPTBot: who's crawling your site in 2025"
Interrupt Media: "Optimize for AI Crawlers in 2025: Website Checklist"
Qwairy: "Complete Guide to Robots.txt and LLMs.txt for AI Crawlers"
Momentic: "List of Top AI Search Crawlers + User Agents (April 2025)"
GenRank: "Optimizing Your Robots.txt for Generative AI Crawlers"
Gracker AI: "Core Web Vitals Optimization: Technical SEO Guide for 2025"
UXify: "Core Web Vitals 2025 Guide: Essential Metrics"
ClickRank AI: "Site Speed Optimization in 2025: Pass Core Web Vitals Fast"
NitroPack: "Core Web Vitals: Everything You Need to Know (2025)"
AI Rank Vision: "AI Technical SEO Audits: Complete 2025 Guide"

Son Güncelleme: 9 Ocak 2025

Hakkında

Bu technical guide developers, DevOps engineers, ve technical SEO specialists için yazılmıştır. Code samples, command-line examples, ve actionable implementation tactics. Non-technical okuyucular için: Core concepts anlaşılır, implementation için developer desteği alın.

Disclaimer:

Technical SEO sonuçları site büyüklüğü, mevcut technical debt, infrastructure complexity'ye göre değişir. ROI projeksiyonları ortalama B2B SaaS metrics kullanılarak hesaplanmıştır. Server costs, developer time (if outsourced) additional budgets gerektirebilir.

Technical SEO for AI Platforms 2025: GPTBot, PerplexityBot, ClaudeBot Crawl Optimization ve Core Web Vitals

TL;DR

İlgili Kaynaklar: AI Bot Crawler Yönetimi Rehberi | Schema Markup & AI Arama Rehberi | ChatGPT Citation Optimizasyonu | Çoklu Platform GEO Optimizasyonu | B2B SaaS GEO Case Study

AI Crawler Landscape 2025

AI Bot Traffic Distribution (May 2024 vs May 2025):

Major AI Crawlers (Traffic Share):

GPTBot (OpenAI):
- May 2024: %5
- May 2025: %30 (+500% share increase)
- Behavior: Training crawler (static data harvest for GPT-5, GPT-6 training)
- Crawl frequency: Weekly full site crawl (high-priority sites), monthly (low-priority)

Meta-ExternalAgent (Meta AI):
- May 2024: Not tracked (negligible)
- May 2025: %19 (strong entry, Meta AI/Llama investment)
- Behavior: Training crawler (Meta Llama 4 training data)
- Crawl frequency: Bi-weekly

ClaudeBot (Anthropic):
- May 2024: %11.7
- May 2025: %5.4 (-54% share decline, -46% request count)
- Why decline? Anthropic shifted to selective crawling (quality > quantity), partnerships (licensed data)
- Behavior: Index refinement (not heavy training, more curated)
- Crawl frequency: Monthly

PerplexityBot (Perplexity AI):
- May 2024: <1% (tiny)
- May 2025: %8 (+157,490% raw request increase!)
- Why explosion? Perplexity growth (22M users), real-time web search dependency
- Behavior: RAG crawler (real-time retrieval for search answers, not training)
- Crawl frequency: Daily (fresh content sites), weekly (evergreen)

Google-Extended (Google AI/Bard/Gemini):
- May 2024: %18
- May 2025: %22 (+22% growth)
- Behavior: Training crawler (separate from Googlebot, Gemini/Bard training)
- Crawl frequency: Weekly

Bingbot + Microsoft AI:
- May 2024: %12
- May 2025: %10 (-16% decline, Bing losing to ChatGPT)
- Behavior: Mixed (Bing search index + Copilot training)

Bytespider (ByteDance/TikTok AI):
- May 2024: %8
- May 2025: %6 (-25%)
- Behavior: TikTok content recommendations, Douyin AI (China market focus)

Training Crawlers vs RAG Crawlers:

Training Crawlers (Static Data Harvest):

Purpose: Build LLM's world knowledge (pre-training, fine-tuning)
Crawl pattern: Broad, deep (entire site, all pages)
Frequency: Monthly/quarterly (not real-time)
Examples: GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent

What they want:
- Comprehensive content (long-form articles, guides, documentation)
- Evergreen knowledge (not news, time-sensitive data)
- Clean HTML (no JavaScript dependency)
- Structured data (schema markup for entity extraction)

RAG Crawlers (Real-Time Retrieval):

Purpose: Pull fresh content for live search/chat answers
Crawl pattern: Targeted, frequent (high-value pages, fresh content)
Frequency: Daily/hourly (real-time dependency)
Examples: PerplexityBot, OAI-SearchBot (ChatGPT search feature), Claude-User

What they want:
- Fresh content (news, statistics, recent updates)
- Fast loading (speed critical, timeout = skip page)
- Direct answers (FAQ format, clear structure)
- Citation-friendly (clear authorship, publish dates)

Technical Implication:

Training crawlers: Optimize entire site (deep crawl, all pages valuable)
RAG crawlers: Optimize high-value pages (priority: homepage, pillar content, fresh articles)

Budget allocation:
- Training crawlers: 60% effort (broad site health)
- RAG crawlers: 40% effort (targeted page optimization)

Robots.txt Optimization for AI Crawlers

AI Bot User Agents (Complete List 2025)

OpenAI Bots:

User-agent: GPTBot
# Purpose: Model training (GPT-4, GPT-5, future models)
# Respects robots.txt: Yes
# JavaScript rendering: No

User-agent: OAI-SearchBot
# Purpose: ChatGPT search feature (real-time web search)
# Respects robots.txt: Yes
# JavaScript rendering: Partial (limited)

User-agent: ChatGPT-User
# Purpose: User-driven browsing (when user asks ChatGPT to visit URL)
# Respects robots.txt: Yes
# JavaScript rendering: Yes (headless browser)

Allow Strategy (Maximum Visibility):
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

Anthropic Bots:

User-agent: ClaudeBot
# Purpose: Index refinement (not heavy training, curated crawl)
# Respects robots.txt: Yes
# JavaScript rendering: No

User-agent: Claude-Web
# Purpose: Web-focused crawl (broader than ClaudeBot)
# Respects robots.txt: Yes

User-agent: Claude-User
# Purpose: On-demand fetch (when user asks Claude to visit URL)
# Respects robots.txt: Yes
# JavaScript rendering: Yes (headless browser)

Allow Strategy:
User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Claude-User
Allow: /

Perplexity Bots:

User-agent: PerplexityBot
# Purpose: Perplexity AI search index (RAG, real-time search)
# Respects robots.txt: Yes (officially, controversy exists)
# JavaScript rendering: No
# Crawl rate: High (daily for fresh content)

Allow Strategy:
User-agent: PerplexityBot
Allow: /

Note: Perplexity faced controversy (2024) for allegedly bypassing robots.txt blocks
Official statement: "We respect robots.txt" (May 2024 update)

Google Bots:

User-agent: Googlebot
# Purpose: Google Search index (traditional SEO)
# Respects robots.txt: Yes
# JavaScript rendering: Yes (full rendering since 2019)

User-agent: Google-Extended
# Purpose: Google AI training (Bard, Gemini, future models)
# Respects robots.txt: Yes
# JavaScript rendering: No (separate from Googlebot)

Important Distinction:
- Blocking Google-Extended: Blocks AI training (Gemini), but Google Search still indexes (Googlebot allowed)
- Blocking Googlebot: Blocks Google Search + AI (both blocked)

Strategy (Maximum Visibility):
User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Allow: /

Strategy (Block AI Training, Allow Search):
User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

# Why? Some publishers want Google Search traffic but not AI training
# Trade-off: Google AI Overviews may not cite you (Google-Extended blocked)

Complete Robots.txt Template (AI-Friendly):

# Robots.txt for Maximum AI Visibility (2025)

# Allow all major AI crawlers
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Disallow low-value pages (all crawlers)
User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /api/
Disallow: /private/

# Allow high-value content
Allow: /blog/
Allow: /guides/
Allow: /resources/
Allow: /products/

# Crawl-delay (optional, reduce server load)
# Note: GPTBot, ClaudeBot don't support Crawl-delay (ignore it)
# Googlebot respects it
User-agent: *
Crawl-delay: 1

# Sitemap (critical for AI crawlers, discovery)
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml
Sitemap: https://example.com/product-sitemap.xml

Selective Blocking (Protecting Proprietary Content):

Scenario: Block AI training but allow search indexing

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Why?
- Proprietary content (e-books, courses, research)
- Competitive advantage (don't want AI models trained on your unique data)
- Legal concerns (copyright, licensing)

Trade-off:
- AI platforms (ChatGPT, Claude, Gemini) won't cite you (no training data)
- Google/Bing Search still indexes (organic traffic preserved)

Who does this?
- NYT, WSJ (blocked GPTBot 2023, then negotiated licensing deals)
- Reddit (blocked GPTBot, then $60M Google deal)
- Stack Overflow (blocked GPTBot 2023, protecting Q&A data)

Core Web Vitals for AI Platforms

INP (Interaction to Next Paint) - 2024's New Metric

INP Basics:

What is INP?
Interaction to Next Paint = responsiveness metric
Measures: Time from user interaction (click, tap, keyboard) to next visual update

Replaced: FID (First Input Delay) in March 2024
Why? FID only measured first interaction, INP measures all interactions

Target:
- Good: ≤200ms
- Needs improvement: 200-500ms
- Poor: >500ms

Why AI Platforms Care:
- RAG crawlers (PerplexityBot, OAI-SearchBot) simulate user interaction (click "Read More", expand FAQ)
- Poor INP = crawler timeout, page skip (citation loss)
- Google AI Overviews: INP affects Google Search ranking → affects AI Overview inclusion

Measurement:
- Chrome DevTools → Performance tab → Interactions
- PageSpeed Insights (Google): INP score
- Web Vitals Extension (Chrome): Real-time INP

INP Optimization Tactics:

1. Reduce JavaScript Execution Time:

Problem: Heavy JavaScript blocks main thread, delays interaction response

Example:
User clicks FAQ accordion → JavaScript runs 800ms → Accordion expands
INP: 800ms (Poor!)

Solution: Code splitting, defer non-critical JS

Before (Poor INP):
<script src="bundle.js"></script> <!-- 2MB bundle, blocks main thread -->

After (Good INP):
<script src="critical.js"></script> <!-- 50KB, essential only -->
<script defer src="non-critical.js"></script> <!-- Loads after page interactive -->

Result: INP 800ms → 120ms (-85% improvement)

2. Optimize Event Handlers:

Problem: Heavy event listeners (click, scroll) delay response

Bad code:
document.querySelector('.button').addEventListener('click', function() {
  // Synchronous, heavy operation (200ms)
  processLargeDataset();
  updateUI();
});

Good code:
document.querySelector('.button').addEventListener('click', async function() {
  // Show immediate feedback (instant)
  showLoadingSpinner();

  // Async operation (non-blocking)
  await processLargeDataset();
  updateUI();
  hideLoadingSpinner();
});

Result: User sees instant response (spinner), perceived INP <50ms

3. Web Workers (Offload Heavy Computation):

Problem: JavaScript computation blocks main thread

Solution: Web Workers (separate thread)

// Main thread (blocks INP)
function heavyCalculation() {
  // 500ms computation
  let result = complexAlgorithm(largeData);
  updateUI(result);
}

// Web Worker (doesn't block INP)
// main.js
const worker = new Worker('worker.js');
worker.postMessage(largeData);

worker.onmessage = function(e) {
  updateUI(e.data); // Update UI with result
};

// worker.js
onmessage = function(e) {
  let result = complexAlgorithm(e.data);
  postMessage(result);
};

Result: Main thread free, INP unaffected

4. Debounce/Throttle Input Handlers:

Problem: Search autocomplete fires 10 events/second (typing "artificial intelligence")

Bad (no debounce):
input.addEventListener('input', function(e) {
  // Fires 10x/second, API call each time
  fetchSearchResults(e.target.value);
});

Good (debounced):
import debounce from 'lodash/debounce';

input.addEventListener('input', debounce(function(e) {
  // Fires once after 300ms pause
  fetchSearchResults(e.target.value);
}, 300));

Result: 10 API calls → 1 API call, INP improved

LCP (Largest Contentful Paint) Optimization:

Target: ≤2.5 seconds

What is LCP?
Largest visible element load time (hero image, H1 heading, video)

AI Platform Impact:
- Slow LCP = crawlers timeout (GPTBot, PerplexityBot wait max 5-8 seconds)
- Google AI Overviews: LCP affects ranking → affects citation

Common LCP Elements:
- Hero image (homepage banner, article header image)
- H1 heading with background image
- Video (above-the-fold)
- Large text block (article content)

Optimization Tactics:

1. Image Optimization:

Before (Poor LCP):
<img src="hero.jpg" width="1920" height="1080"> <!-- 2.5MB, uncompressed -->

After (Good LCP):
<img
  src="hero.webp"
  srcset="hero-480w.webp 480w, hero-960w.webp 960w, hero-1920w.webp 1920w"
  sizes="(max-width: 480px) 480px, (max-width: 960px) 960px, 1920px"
  width="1920"
  height="1080"
  loading="eager"
  fetchpriority="high"
  alt="Hero image">
<!-- WebP format (70% smaller), responsive images, priority hints -->

Result: 2.5MB → 400KB, LCP 4.2s → 1.8s

2. Preload Critical Resources:

<head>
  <!-- Preload hero image (highest priority) -->
  <link rel="preload" as="image" href="hero.webp" fetchpriority="high">

  <!-- Preload critical fonts (avoid FOIT/FOUT) -->
  <link rel="preload" as="font" href="fonts/inter-bold.woff2" type="font/woff2" crossorigin>

  <!-- Preconnect to external domains (CDN, API) -->
  <link rel="preconnect" href="https://cdn.example.com">
</head>

Result: Browser prioritizes LCP resources, faster render

3. Remove Render-Blocking Resources:

Before (Blocks LCP):
<head>
  <link rel="stylesheet" href="styles.css"> <!-- Blocks rendering -->
  <script src="analytics.js"></script> <!-- Blocks rendering -->
</head>

After (Non-Blocking):
<head>
  <!-- Inline critical CSS (above-the-fold styles) -->
  <style>
    /* Critical CSS (header, hero, navigation) */
    .hero { background: url('hero.webp'); height: 600px; }
    h1 { font-size: 48px; color: #000; }
  </style>

  <!-- Defer non-critical CSS -->
  <link rel="preload" as="style" href="styles.css" onload="this.onload=null;this.rel='stylesheet'">

  <!-- Defer JavaScript -->
  <script defer src="analytics.js"></script>
</head>

Result: LCP elements render immediately (no blocking)

4. CDN + Caching:

Without CDN:
- Server: US East (Virginia)
- User: Tokyo
- Latency: 280ms (round-trip)
- LCP image download: 1.8s

With CDN (Cloudflare, AWS CloudFront):
- Edge server: Tokyo
- Latency: 15ms
- LCP image download: 380ms

Result: LCP 3.2s → 1.4s (-56% improvement)

CDN Setup:
1. Sign up: Cloudflare (free tier), AWS CloudFront, Fastly
2. Configure DNS (point to CDN)
3. Set cache headers:

# .htaccess (Apache) or nginx.conf
<IfModule mod_expires.c>
  ExpiresActive On
  ExpiresByType image/webp "access plus 1 year"
  ExpiresByType text/css "access plus 1 month"
  ExpiresByType application/javascript "access plus 1 month"
</IfModule>

4. Purge cache when content updates

CLS (Cumulative Layout Shift) Optimization:

Target: ≤0.1

What is CLS?
Layout stability = elements don't shift unexpectedly during page load

AI Platform Impact:
- Layout shifts confuse crawlers (element positions change, extraction errors)
- Poor UX signal (Google ranking factor, affects AI Overview inclusion)

Common CLS Causes:

1. Images Without Dimensions:

Bad (Causes CLS):
<img src="article.jpg" alt="Article image">
<!-- Browser doesn't know size, reserves no space, shifts when loaded -->

Good (No CLS):
<img src="article.jpg" alt="Article image" width="800" height="600">
<!-- Browser reserves 800×600 space, no shift when image loads -->

Or use aspect-ratio CSS:
<img src="article.jpg" alt="Article image" style="aspect-ratio: 4/3; width: 100%;">

2. Ads/Embeds Without Reserved Space:

Bad:
<div id="ad-slot"></div>
<!-- Ad loads dynamically, pushes content down (CLS!) -->

Good:
<div id="ad-slot" style="min-height: 250px;">
  <!-- Reserved space, content doesn't shift -->
</div>

3. Web Fonts (FOIT/FOUT):

Problem: Font loads, text re-renders, layout shifts

Solution: font-display: swap + preload

<head>
  <link rel="preload" as="font" href="inter.woff2" type="font/woff2" crossorigin>

  <style>
    @font-face {
      font-family: 'Inter';
      src: url('inter.woff2') format('woff2');
      font-display: swap; /* Show fallback font immediately, swap when loaded */
    }
  </style>
</head>

4. Dynamic Content Injection:

Bad:
<!-- Content loads, banner injected at top, pushes everything down -->
<div id="top-banner"></div>
<main>Article content...</main>

<script>
  // Injects banner after page load (CLS!)
  document.getElementById('top-banner').innerHTML = '<div>Newsletter signup!</div>';
</script>

Good:
<!-- Reserve space with skeleton/placeholder -->
<div id="top-banner" style="min-height: 100px;">
  <div class="skeleton-loader"></div> <!-- Placeholder -->
</div>

<script>
  // Replaces skeleton, no layout shift
  document.getElementById('top-banner').innerHTML = '<div>Newsletter signup!</div>';
</script>

CLS Measurement:
- Chrome DevTools → Performance → Experience → Layout Shifts (red bars)
- PageSpeed Insights: CLS score + screenshot of shifting elements
- Web Vitals Extension: Real-time CLS

JavaScript Rendering Problem

AI Crawlers Don't Execute JavaScript

The Problem:

Reality Check (2025):

- 80%+ AI crawlers don't render JavaScript (GPTBot, ClaudeBot, PerplexityBot, Google-Extended)
- Only user-driven bots render JS (ChatGPT-User, Claude-User when user asks to visit URL)
- Googlebot renders JS (full rendering since 2019), but Google-Extended (AI training) doesn't

Implication:
- React/Vue/Angular SPA (Single Page Application): AI crawlers see empty page
- Client-side content loading: Invisible to AI
- JavaScript-based navigation: Broken links for AI

Example (React SPA):

HTML delivered to crawler:
<!DOCTYPE html>
<html>
<head><title>My Blog</title></head>
<body>
  <div id="root"></div>
  <script src="bundle.js"></script>
</body>
</html>

What GPTBot sees:
Empty <div id="root"> (no content!)

What should be there (after JS executes):
<div id="root">
  <h1>10 Best AI Tools for 2025</h1>
  <p>Artificial intelligence is transforming...</p>
  ...
</div>

Result: GPTBot sees no content, skips page, no citation

Solutions:

1. Server-Side Rendering (SSR):

What is SSR?
Server generates HTML (fully rendered) before sending to browser/crawler

Frameworks with SSR:
- Next.js (React): Industry standard for React SSR
- Nuxt.js (Vue): Vue equivalent
- SvelteKit (Svelte): Lightweight SSR
- Angular Universal (Angular): Angular SSR

Example (Next.js):

// pages/blog/[slug].js (Next.js SSR)
export async function getServerSideProps(context) {
  // Fetch data on server (not client)
  const post = await fetchBlogPost(context.params.slug);

  return {
    props: { post } // Pass to component
  };
}

export default function BlogPost({ post }) {
  return (
    <article>
      <h1>{post.title}</h1>
      <div dangerouslySetInnerHTML={{ __html: post.content }} />
    </article>
  );
}

What crawler sees:
<!DOCTYPE html>
<html>
<head><title>10 Best AI Tools for 2025</title></head>
<body>
  <article>
    <h1>10 Best AI Tools for 2025</h1>
    <div>
      <p>Artificial intelligence is transforming...</p>
      ...
    </div>
  </article>
</body>
</html>

Result: Full content in HTML, GPTBot/ClaudeBot/PerplexityBot can crawl

SSR Benefits for AI:
✅ Full content accessible (no JavaScript required)
✅ Faster initial load (content rendered server-side)
✅ SEO + GEO friendly (all crawlers can index)

SSR Drawbacks:
❌ Server cost (CPU usage higher, more server rendering)
❌ Complexity (SSR setup more complex than client-side)
❌ TTFB (Time to First Byte) higher (server processing time)

2. Static Site Generation (SSG):

What is SSG?
Pre-render all pages at build time (HTML files generated)

Best for:
- Blogs, documentation, marketing sites (content doesn't change frequently)
- 10-10,000 pages (feasible to pre-render)

Frameworks:
- Next.js (getStaticProps, getStaticPaths)
- Gatsby (React)
- Hugo (Go-based, fastest)
- Jekyll (Ruby, GitHub Pages)
- Astro (modern, multi-framework)

Example (Next.js SSG):

// pages/blog/[slug].js
export async function getStaticProps({ params }) {
  const post = await fetchBlogPost(params.slug);
  return { props: { post } };
}

export async function getStaticPaths() {
  const posts = await fetchAllBlogPosts();
  const paths = posts.map(post => ({ params: { slug: post.slug } }));
  return { paths, fallback: false };
}

export default function BlogPost({ post }) {
  return <article><h1>{post.title}</h1>...</article>;
}

Build process:
npm run build
→ Generates static HTML files:
   out/blog/ai-tools-2025.html
   out/blog/seo-trends.html
   ...

Deploy: Upload HTML files to CDN (Vercel, Netlify, AWS S3 + CloudFront)

SSG Benefits:
✅ Fastest performance (static HTML, no server rendering)
✅ Cheapest hosting (static files, CDN-only)
✅ Perfect for crawlers (full HTML, instant)

SSG Drawbacks:
❌ Build time (1,000 pages = 5-10 minutes build)
❌ Not real-time (content updates require rebuild)
❌ Dynamic content limitation (user-specific content difficult)

3. Dynamic Rendering (Hybrid Approach):

What is Dynamic Rendering?
Serve SSR/SSG HTML to crawlers, client-side JS to users

How it works:
1. Detect user-agent (is it a crawler?)
2. If crawler → Serve pre-rendered HTML
3. If user → Serve client-side JS app

Implementation:

// middleware.js (Next.js)
export function middleware(request) {
  const userAgent = request.headers.get('user-agent');

  // List of crawler user-agents
  const crawlers = ['Googlebot', 'GPTBot', 'ClaudeBot', 'PerplexityBot', 'Google-Extended'];

  const isCrawler = crawlers.some(bot => userAgent.includes(bot));

  if (isCrawler) {
    // Serve pre-rendered HTML (SSR or cached HTML)
    return NextResponse.rewrite(new URL('/prerendered' + request.nextUrl.pathname, request.url));
  }

  // Serve client-side app (React SPA)
  return NextResponse.next();
}

Tools for Dynamic Rendering:
- Prerender.io ($20-200/mo): Crawler detection + pre-rendering service
- Rendertron (Google, open-source): Headless Chrome rendering
- Puppeteer + Cache: DIY solution (render with Puppeteer, cache HTML)

Pros:
✅ Best of both worlds (fast SPA for users, full HTML for crawlers)
✅ No rebuild needed (real-time content updates)

Cons:
❌ Complexity (middleware, cache management)
❌ Cost (pre-rendering service or server resources)
❌ Cloaking risk (Google penalizes showing different content to crawlers vs users, but acceptable if content equivalent)

4. Progressive Enhancement (Fallback Approach):

Concept: Start with semantic HTML, enhance with JavaScript

Example (FAQ Accordion):

HTML (no JavaScript):
<section>
  <h2>Frequently Asked Questions</h2>

  <details>
    <summary>What is GEO?</summary>
    <p>Generative Engine Optimization (GEO) is the practice of optimizing content for AI platforms like ChatGPT, Perplexity, and Claude...</p>
  </details>

  <details>
    <summary>How is GEO different from SEO?</summary>
    <p>SEO focuses on ranking in Google search results, while GEO focuses on getting cited by AI platforms...</p>
  </details>
</section>

JavaScript Enhancement (for users):
<script>
  // Convert <details> to fancy accordion (animated, styled)
  document.querySelectorAll('details').forEach(detail => {
    // Add animations, custom styling, analytics tracking
    enhanceAccordion(detail);
  });
</script>

What crawlers see:
Full content in <details> tags (HTML semantic, no JS required)

What users see:
Enhanced accordion (animations, better UX)

Progressive Enhancement Benefits:
✅ Crawlers get full content (HTML fallback)
✅ Users get enhanced experience (JavaScript enhancements)
✅ Resilient (JavaScript fails? HTML fallback works)

When to use:
- Interactive components (accordions, tabs, modals)
- Form enhancements (validation, autocomplete)
- Navigation (dropdowns, mega-menus)

Crawl Budget Optimization

Understanding AI Crawler Behavior

Crawl Budget Basics:

What is Crawl Budget?
Number of pages a crawler will fetch from your site in given time

Googlebot Crawl Budget:
- Large site (1M+ pages): 10K-100K pages/day
- Medium site (10K-100K pages): 1K-10K pages/day
- Small site (<10K pages): 100-1K pages/day

AI Crawler Crawl Budget (Estimated):
- GPTBot: 40% of Googlebot rate (slower, more selective)
- PerplexityBot: 60% of Googlebot (faster, RAG dependency)
- ClaudeBot: 30% of Googlebot (very selective, quality over quantity)
- Google-Extended: 50% of Googlebot

Why Lower?
- AI crawlers focus on quality (not comprehensive indexing like Google)
- Server costs (AI companies smaller than Google, less infrastructure)
- Selective crawling (training data curation, not every page valuable)

Implication:
- Prioritize high-value pages (AI crawlers may not crawl entire site)
- Reduce waste (duplicate pages, low-quality content consume budget)

Crawl Budget Optimization Tactics:

1. XML Sitemap Optimization:

Priority: Tell crawlers which pages matter most

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <!-- High-priority pages (pillar content, guides) -->
  <url>
    <loc>https://example.com/complete-geo-guide</loc>
    <lastmod>2025-01-09</lastmod>
    <changefreq>monthly</changefreq>
    <priority>1.0</priority> <!-- Highest priority -->
  </url>

  <!-- Medium-priority pages (blog posts) -->
  <url>
    <loc>https://example.com/blog/chatgpt-optimization</loc>
    <lastmod>2025-01-05</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>

  <!-- Low-priority pages (tag pages, archives) -->
  <url>
    <loc>https://example.com/tag/seo</loc>
    <lastmod>2024-12-01</lastmod>
    <changefreq>yearly</changefreq>
    <priority>0.3</priority> <!-- Low priority -->
  </url>

</urlset>

Sitemap Best Practices:
✅ Split large sitemaps (max 50K URLs per sitemap, max 50MB file size)
✅ Update lastmod accurately (crawlers prioritize recently updated pages)
✅ Use priority wisely (not all pages 1.0, differentiate value)
✅ Multiple sitemaps:
   - sitemap-blog.xml (blog posts)
   - sitemap-guides.xml (pillar content)
   - sitemap-products.xml (product pages)

Submit to:
- Google Search Console (Googlebot + Google-Extended)
- Bing Webmaster Tools (Bingbot)
- robots.txt (all crawlers discover)

Sitemap: https://example.com/sitemap-index.xml

2. Remove Duplicate Content:

Problem: Duplicate pages waste crawl budget

Common duplicates:
- URL parameters: /product?color=red vs /product?color=blue (same content)
- HTTP vs HTTPS: http://example.com vs https://example.com
- www vs non-www: www.example.com vs example.com
- Trailing slash: /about vs /about/
- Mobile vs desktop: m.example.com vs example.com

Solutions:

a) Canonical tags (tell crawlers which version is primary):
<link rel="canonical" href="https://example.com/product">

b) 301 redirects (redirect duplicates to canonical):
# .htaccess (Apache)
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule ^(.*)$ https://example.com/$1 [R=301,L]

c) robots.txt (block low-value parameter pages):
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

3. Fix Crawl Errors:

Monitor Google Search Console → Coverage report:
- 404 errors: Remove broken links, or 301 redirect to relevant page
- 500 errors: Server errors (fix ASAP, crawlers deprioritize error-prone sites)
- Soft 404s: Pages returning 200 but have no content (fix or remove)

AI Crawler Specific:
- Check server logs for GPTBot, PerplexityBot, ClaudeBot 4xx/5xx errors
- Timeout errors (page loads >8s, crawler gives up): Optimize speed

4. Reduce Low-Value Pages:

Audit: Which pages get crawled but no traffic/citations?

Examples:
- Tag pages (100 tags = 100 low-value pages)
- Author archives (10 authors = 10 pages, minimal content)
- Date archives (/2024/01/, /2024/02/, ... /2024/12/ = 12 pages)
- Pagination (page 50 of blog = zero value)

Solutions:
- Noindex low-value pages: <meta name="robots" content="noindex, follow">
- Consolidate (remove excessive tags, keep top 10-20 most used)
- Robots.txt block (Disallow: /author/, Disallow: /tag/)

Result: Crawl budget focused on high-value content (pillar posts, guides)

5. Internal Linking Optimization:

Crawlers prioritize pages with more internal links (signal of importance)

Hub-and-Spoke Model:
- Hub (pillar content): "Complete GEO Guide 2025"
  → Spoke 1: "ChatGPT Optimization"
  → Spoke 2: "Perplexity Optimization"
  → Spoke 3: "Claude Optimization"

Pillar page: 50 internal links pointing to it (high priority)
Spoke pages: 10 internal links each (medium priority)

Implementation:
- Navigation menu: Link to pillar pages (every page links to them)
- Contextual links: In-content links to related articles
- Footer links: "Popular Resources" section (links to top 5-10 pages)

Tool: Screaming Frog SEO Spider → Internal links report (identify orphan pages, low link count pages)

Schema Markup Mastery for AI

Advanced Schema Types (AI Platform Parsing)

Article Schema (Comprehensive):

json
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Technical SEO for AI Platforms 2025: Complete Guide",
  "description": "AI crawler optimization: GPTBot, PerplexityBot, ClaudeBot crawl optimization, Core Web Vitals, schema markup.",
  "image": [
    "https://example.com/images/technical-seo-ai-1x1.jpg",
    "https://example.com/images/technical-seo-ai-4x3.jpg",
    "https://example.com/images/technical-seo-ai-16x9.jpg"
  ],
  "datePublished": "2025-01-09T08:00:00+00:00",
  "dateModified": "2025-01-09T08:00:00+00:00",
  "author": {
    "@type": "Person",
    "name": "John Doe",
    "url": "https://example.com/author/john-doe",
    "sameAs": [
      "https://twitter.com/johndoe",
      "https://linkedin.com/in/johndoe"
    ]
  },
  "publisher": {
    "@type": "Organization",
    "name": "AIseo Optimizer",
    "logo": {
      "@type": "ImageObject",
      "url": "https://example.com/logo.png",
      "width": 600,
      "height": 60
    }
  },
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://example.com/blog/technical-seo-ai-platforms"
  },
  "articleSection": "Technical SEO",
  "keywords": "Technical SEO, AI crawlers, GPTBot, PerplexityBot, Core Web Vitals, schema markup",
  "wordCount": 8500,
  "inLanguage": "tr-TR",
  "copyrightYear": 2025,
  "copyrightHolder": {
    "@type": "Organization",
    "name": "AIseo Optimizer"
  }
}

Why Each Field Matters (AI Platforms):

headline: ChatGPT/Perplexity extract as title in citation
description: Used in AI answer summaries
image: Multi-aspect ratio = Gemini multi-modal search, ChatGPT visual answers
datePublished: Perplexity freshness ranking (recent = higher priority)
dateModified: Updates signal (frequent updates = active content)
author: E-E-A-T signal (credentials verification)
publisher: Brand recognition (established publisher = trust)
articleSection: Topic clustering (AI understands content category)
keywords: Semantic signals (AI topic modeling)
wordCount: Depth signal (longer = comprehensive, but quality matters)

HowTo Schema (Step-by-Step Guides):

json
{
  "@context": "https://schema.org",
  "@type": "HowTo",
  "name": "How to Optimize robots.txt for AI Crawlers",
  "description": "Step-by-step guide to configure robots.txt for GPTBot, PerplexityBot, ClaudeBot, and Google-Extended.",
  "image": "https://example.com/images/robots-txt-guide.jpg",
  "totalTime": "PT30M",
  "estimatedCost": {
    "@type": "MonetaryAmount",
    "currency": "USD",
    "value": "0"
  },
  "tool": [
    {
      "@type": "HowToTool",
      "name": "Text editor (VS Code, Sublime Text, Notepad++)"
    },
    {
      "@type": "HowToTool",
      "name": "FTP client or hosting control panel"
    }
  ],
  "step": [
    {
      "@type": "HowToStep",
      "name": "Create robots.txt file",
      "text": "Create a plain text file named robots.txt in your website's root directory.",
      "image": "https://example.com/images/step1-create-file.jpg",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step1"
    },
    {
      "@type": "HowToStep",
      "name": "Add AI crawler user-agents",
      "text": "Add User-agent directives for GPTBot, ClaudeBot, PerplexityBot, and Google-Extended.",
      "image": "https://example.com/images/step2-add-agents.jpg",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step2"
    },
    {
      "@type": "HowToStep",
      "name": "Configure Allow/Disallow rules",
      "text": "Set Allow: / to permit crawling, or Disallow: / to block. Add specific path rules as needed.",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step3"
    },
    {
      "@type": "HowToStep",
      "name": "Upload robots.txt",
      "text": "Upload the file to your site's root (https://yourdomain.com/robots.txt).",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step4"
    },
    {
      "@type": "HowToStep",
      "name": "Test robots.txt",
      "text": "Use Google Search Console Robots.txt Tester or visit https://yourdomain.com/robots.txt to verify.",
      "url": "https://example.com/blog/technical-seo-ai-platforms#step5"
    }
  ]
}

Why HowTo Schema Works (AI Citations):

ChatGPT: Extracts step-by-step instructions for conversational guidance
Perplexity: Cites HowTo content for procedural queries ("how to optimize robots.txt")
Claude: Values structured instructions (technical accuracy preference)
Google AI Overviews: HowTo rich results (featured snippets, visual steps)

Implementation: JSON-LD in <head> or <script> tag at bottom of <body>

FAQ Schema (Voice Search + AI Conversational Queries):

json
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is GPTBot?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "GPTBot is OpenAI's web crawler that collects data for training GPT models (GPT-4, GPT-5, future versions). It respects robots.txt and can be allowed or blocked using User-agent: GPTBot directives."
      }
    },
    {
      "@type": "Question",
      "name": "Should I block Google-Extended?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Blocking Google-Extended prevents your content from being used to train Google AI models (Gemini, Bard) but doesn't affect Google Search indexing (Googlebot is separate). Block if you want to protect proprietary content while maintaining search visibility."
      }
    },
    {
      "@type": "Question",
      "name": "How often does PerplexityBot crawl?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "PerplexityBot crawls daily for fresh content sites (news, blogs) and weekly for evergreen content. It's a RAG crawler (real-time retrieval) so it prioritizes recently updated pages for Perplexity AI search answers."
      }
    }
  ]
}

Measurement ve Monitoring

Server Log Analysis (AI Crawler Tracking)

Apache/Nginx Log Analysis:

bash
# Apache access.log example
grep "GPTBot" /var/log/apache2/access.log | wc -l
# Output: 1,247 (GPTBot requests in log file)

# Analyze most crawled pages (GPTBot)
grep "GPTBot" /var/log/apache2/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

# Output:
# 342 /blog/chatgpt-optimization
# 215 /blog/perplexity-seo-guide
# 187 /complete-geo-guide
# ...

# Check response codes (4xx, 5xx errors)
grep "GPTBot" /var/log/apache2/access.log | grep " 404 " | wc -l
# Output: 23 (GPTBot encountered 23 404 errors, fix these!)

# Crawl frequency (GPTBot requests per day)
grep "GPTBot" /var/log/apache2/access.log | awk '{print $4}' | cut -d: -f1 | uniq -c

# Output:
# 45 [09/Jan/2025
# 52 [08/Jan/2025
# 38 [07/Jan/2025

Log Analysis Tools:

Free Tools:
- GoAccess (open-source, real-time web log analyzer)
  Installation: sudo apt-get install goaccess
  Usage: goaccess /var/log/apache2/access.log -o report.html --log-format=COMBINED

- AWStats (classic web analytics, bot detection)

Paid Tools:
- Splunk ($15-150/mo): Enterprise log management
- Loggly ($79+/mo): Cloud-based log analysis
- Datadog ($15+/host/mo): Infrastructure monitoring + logs

What to Track:
✅ Crawler visits per day (trend: increasing = good)
✅ Pages crawled (which content AI bots prioritize?)
✅ Errors (404, 500 = fix immediately, waste crawl budget)
✅ Crawl depth (how deep into site structure?)
✅ Bandwidth usage (AI crawlers consuming resources?)

Google Search Console (Googlebot + Google-Extended):

Settings → Crawl Stats:

Metrics:
- Total crawl requests (requests/day)
- Total download size (KB/day)
- Average response time (ms)

What to monitor:
- Crawl requests dropping? (Check robots.txt, server errors)
- Response time increasing? (Site speed issues, server overload)
- Download size spiking? (Large files, unoptimized images)

Googlebot vs Google-Extended:
- Google Search Console shows Googlebot only (not Google-Extended separately)
- To track Google-Extended: Server logs (grep "Google-Extended")

Actionable 60-Day Technical SEO Roadmap

Phase 1: Foundation (Days 1-20)

Week 1: Audit

✅ Core Web Vitals audit:
   - PageSpeed Insights (desktop + mobile)
   - Identify LCP, INP, CLS issues
   - Benchmark: Current scores

✅ JavaScript rendering test:
   - View source (curl https://yourdomain.com)
   - If empty <body> → SSR/SSG needed

✅ Robots.txt audit:
   - Check current robots.txt (https://yourdomain.com/robots.txt)
   - Are AI crawlers allowed/blocked?

✅ Schema audit:
   - Google Rich Results Test (search.google.com/test/rich-results)
   - Missing schemas? Errors?

Week 2-3: Core Web Vitals Fixes

✅ LCP optimization:
   - Image compression (WebP format)
   - CDN setup (Cloudflare free tier)
   - Preload critical resources

✅ INP optimization:
   - Defer non-critical JavaScript
   - Code splitting (if using React/Vue)
   - Debounce input handlers

✅ CLS fixes:
   - Add dimensions to all images
   - Reserve space for ads/embeds
   - Font-display: swap

Target: All Core Web Vitals "Good" (green)

Budget: $100-300 (CDN, image optimization tools)

Phase 2: AI Crawler Optimization (Days 21-40)

Week 4: Robots.txt + Sitemap

✅ Update robots.txt:
   - Allow GPTBot, ClaudeBot, PerplexityBot, Google-Extended
   - Disallow low-value paths (/admin/, /cart/)
   - Add sitemap reference

✅ XML sitemap optimization:
   - Split by content type (blog, guides, products)
   - Set priorities (1.0 for pillar content, 0.3 for tags)
   - Submit to Google Search Console

Week 5-6: JavaScript Rendering

✅ If SPA (React/Vue/Angular):
   Option A: Migrate to Next.js/Nuxt.js (SSR)
   Option B: Static site generation (Gatsby, Hugo)
   Option C: Dynamic rendering (Prerender.io, $20/mo)

✅ If WordPress/traditional CMS:
   - Ensure content in HTML (not lazy-loaded via JS)
   - Remove unnecessary JavaScript

Budget: $500-2,000 (SSR migration) or $20-200/mo (dynamic rendering service)

Phase 3: Schema + Monitoring (Days 41-60)

Week 7: Schema Implementation

✅ Article schema (all blog posts)
✅ HowTo schema (step-by-step guides)
✅ FAQ schema (FAQ pages, Q&A content)
✅ Organization schema (homepage)
✅ Breadcrumb schema (navigation)

Tools:
- Schema.org markup generator (free)
- Google Tag Manager (inject schema dynamically)
- WordPress plugins (Yoast, Rank Math auto-generate)

Week 8: Monitoring Setup

✅ Server log analysis:
   - Install GoAccess or AWStats
   - Track GPTBot, PerplexityBot, ClaudeBot visits

✅ Google Search Console:
   - Monitor Core Web Vitals report (monthly)
   - Track coverage issues (404s, 500s)

✅ Quarterly review:
   - Re-run PageSpeed Insights
   - Check AI crawler visits (increasing trend?)
   - Citation audit (are optimizations working?)

Budget: $0-50 (log analysis tools, most are free)

60-Day Total Budget: $600-2,500 (depending on SSR migration choice)

Sonuç: Technical Foundation = AI Visibility

Key Takeaways:

AI crawlers farklı: JavaScript render etmiyor (%80+), SSR/SSG zorunlu.
Core Web Vitals critical: INP ≤200ms, LCP ≤2.5s, CLS ≤0.1 (Google + AI platforms ranking factor).
Crawl budget sınırlı: AI crawlers Googlebot'tan %40-60 daha az crawl ediyor, prioritizasyon şart.
Schema markup = AI parsing: Structured data AI platforms için extraction kolaylaştırıyor.
Robots.txt strategy: Allow AI crawlers (maximum visibility) veya selective blocking (proprietary content protection).

ROI Projeksiyonu:

Investment: $600-2,500 (60 gün, SSR migration included)

Expected Technical Improvements:
- Core Web Vitals: "Poor" → "Good" (%100+ score artışı)
- Page load: 4.5s → 1.8s (-60% improvement)
- AI crawler access: JavaScript bloğu kaldırıldı, full content access

SEO + GEO Impact:
- Google ranking: +5-15 pozisyon (Core Web Vitals improvement)
- AI citation rate: +40-70% (content accessibility artışı)
- Crawl budget efficiency: +120% (düşük kalite pages eliminated)

Revenue Impact (Example: B2B SaaS):
- Baseline organic traffic: 15,000/month
- After technical optimization: 22,500/month (+50%, ranking improvement)
- Conversion rate: 3.2%
- New leads: 240/month
- Lead value: $300 (average)
- Monthly revenue impact: $72,000

ROI: ($72,000 - $2,500) / $2,500 = 2,780% (ilk ay)
Ongoing: Technical debt eliminated, compound growth başlıyor

İlk Adım:

Core Web Vitals audit bugün yap (PageSpeed Insights, 5 dakika). "Poor" scores varsa → immediate priority. Technical SEO = foundation, GEO content üzerine inşa edilir.

Technical excellence + great content = AI platform dominance.

Kaynaklar

Cloudflare Blog: "From Googlebot to GPTBot: who's crawling your site in 2025"
Interrupt Media: "Optimize for AI Crawlers in 2025: Website Checklist"
Qwairy: "Complete Guide to Robots.txt and LLMs.txt for AI Crawlers"
Momentic: "List of Top AI Search Crawlers + User Agents (April 2025)"
GenRank: "Optimizing Your Robots.txt for Generative AI Crawlers"
Gracker AI: "Core Web Vitals Optimization: Technical SEO Guide for 2025"
UXify: "Core Web Vitals 2025 Guide: Essential Metrics"
ClickRank AI: "Site Speed Optimization in 2025: Pass Core Web Vitals Fast"
NitroPack: "Core Web Vitals: Everything You Need to Know (2025)"
AI Rank Vision: "AI Technical SEO Audits: Complete 2025 Guide"

Son Güncelleme: 9 Ocak 2025

Hakkında

Disclaimer:

Technical SEO for AI Platforms 2025: GPTBot, PerplexityBot, ClaudeBot Crawl Optimization ve Core Web Vitals

Technical SEO for AI Platforms 2025: GPTBot, PerplexityBot, ClaudeBot Crawl Optimization ve Core Web Vitals

AI Crawler Landscape 2025

Crawler Market Share ve Behavior

Robots.txt Optimization for AI Crawlers

AI Bot User Agents (Complete List 2025)

Core Web Vitals for AI Platforms

INP (Interaction to Next Paint) - 2024's New Metric

JavaScript Rendering Problem

AI Crawlers Don't Execute JavaScript

Crawl Budget Optimization

Understanding AI Crawler Behavior

Schema Markup Mastery for AI

Advanced Schema Types (AI Platform Parsing)

Measurement ve Monitoring

Server Log Analysis (AI Crawler Tracking)

Actionable 60-Day Technical SEO Roadmap

Sonuç: Technical Foundation = AI Visibility

Kaynaklar

Hakkında

AIseo Optimizer

İlgili Yazılar

AI Bot Yönetimi: GPTBot, PerplexityBot, ClaudeBot, Google-Extended Configuration Rehberi (robots.txt, Rate Limiting, Crawl Optimization)

AI Arama Devrimi 2025: ChatGPT, Perplexity ve Google AI Overviews ile Değişen Dijital Pazarlama Ekosistemi

AI Arama Trendleri 2025: ChatGPT, Perplexity ve Gemini'nin SEO Ekosistemine Etkisi - Veri Analizi

Technical SEO for AI Platforms 2025: GPTBot, PerplexityBot, ClaudeBot Crawl Optimization ve Core Web Vitals

AI Crawler Landscape 2025

Crawler Market Share ve Behavior

Robots.txt Optimization for AI Crawlers

AI Bot User Agents (Complete List 2025)

Core Web Vitals for AI Platforms

INP (Interaction to Next Paint) - 2024's New Metric

JavaScript Rendering Problem

AI Crawlers Don't Execute JavaScript

Crawl Budget Optimization

Understanding AI Crawler Behavior

Schema Markup Mastery for AI

Advanced Schema Types (AI Platform Parsing)

Measurement ve Monitoring

Server Log Analysis (AI Crawler Tracking)

Actionable 60-Day Technical SEO Roadmap

Sonuç: Technical Foundation = AI Visibility

Kaynaklar

Hakkında