How Crawling and Indexing Affect AI Search
A West Columbia parent worried about a slowly-changing mole on her teenager opens ChatGPT and asks, "Dermatologist in West Columbia SC who sees teenagers, accepts BCBS, can do same-week skin checks, and treats moles on darker skin tones." Two clinics appear in the answer. One of the un-named clinics has actually built substantive content on darker-skin-tone mole evaluation — but their JavaScript-heavy site renders that content only after extensive client-side processing, which AI crawlers don't execute. The AI doesn't know the content exists. Crawling and indexing — long thought of as technical SEO concerns — are now also AI-search-foundation concerns.
This article explains how AI crawlers work, how indexing differs from traditional search indexing, and the practical steps to ensure your content gets seen.
The Crawl Pipeline Reality
~10-30%
Estimated share of small-business websites where significant portions of content are effectively invisible to AI crawlers — through JavaScript rendering, robots.txt blocks, server issues, or other technical barriers. Most owners are unaware until they audit specifically.
The AI Crawlers You Need to Know About
Several distinct crawlers feed AI surfaces. Each behaves slightly differently:
GPTBot (OpenAI)
OpenAI's primary crawler. Used to gather data for ChatGPT, including for grounding answers in current information. Respects robots.txt; most sites should allow it.
OAI-SearchBot (OpenAI)
Specifically focused on ChatGPT Search functionality. Reads pages to support live search citations.
ClaudeBot (Anthropic)
Anthropic's crawler for Claude. Similar function to GPTBot.
PerplexityBot (Perplexity)
Perplexity's crawler. Critical for inclusion in Perplexity's source-cited answers.
Google-Extended
Google's AI-specific crawler controls. Separate from Googlebot. Controls whether Google's AI products (Gemini, Bard, etc.) can use your content. Default is allowed unless explicitly blocked.
Applebot-Extended
Apple's AI-specific crawler. Feeds Apple Intelligence and Siri's AI capabilities.
Googlebot, Bingbot
Traditional search-engine crawlers. Their indexes still feed AI surfaces (as discussed in the previous article on search engines feeding AI models).
What Each Crawler Looks For
While details differ, the common-denominator capabilities AI crawlers share:
- HTML parsing (good).
- Limited JavaScript execution (worse than Googlebot in many cases).
- Basic CSS understanding for layout context (limited).
- Schema.org JSON-LD extraction (good).
- Image alt-text reading (good).
- Link following with respect for nofollow and noindex (good).
- PDF parsing (variable — generally less reliable than HTML).
What AI crawlers typically don't do well:
- Heavy JavaScript rendering (most are limited compared to Googlebot).
- Form interactions or click-to-reveal content.
- Authenticated content behind logins.
- Content rendered only after extensive user interaction.
The Crawl-to-Index Pipeline
For your West Columbia dermatology clinic, the practical pipeline:
- Crawler arrival. An AI crawler visits a URL on your site (often discovered through sitemap, links from other sites, or direct submission).
- Server response. Your server returns HTML (or a 404, 503, or redirect).
- Parsing. The crawler parses the HTML, extracting text, structure, links, schema.
- Storage. Parsed content is stored in the AI vendor's internal representation (typically embedded for later retrieval).
- Retrieval. When a relevant query arrives, the AI retrieves from this internal representation and uses the content in answering.
Breakdowns at any stage cause your content to be effectively invisible. The most common failure points:
Failure: Robots.txt blocks
An overly-strict robots.txt blocks AI crawlers. Some platforms block by default; some site owners explicitly block to "protect content." Either way, the AI can't read what's blocked.
Failure: JavaScript-only rendering
If your content (especially provider bios, service descriptions, FAQ) only appears after client-side React/Vue rendering, AI crawlers may miss it entirely. Server-side rendering or pre-rendering is the fix.
Failure: Slow server response
Crawlers have time budgets. A page that takes 8 seconds to return HTML may be abandoned before the crawler completes parsing.
Failure: Authentication walls
Patient-portal-style content behind logins is invisible to AI crawlers.
Failure: Indexing directives
Noindex meta tags, X-Robots-Tag headers, or canonical-URL mismatches can exclude pages from AI retrieval.
Failure: Crawler-specific blocks
Some sites block GPTBot or ClaudeBot specifically while allowing Googlebot. This is increasingly common in publisher contexts; less appropriate for small-business sites that want AI visibility.
The core principle: AI crawling is mostly invisible to site owners — you can have excellent content that AI crawlers can't access, and never know until you audit. The discipline is to verify crawlability explicitly rather than assume.
How to Verify AI Crawler Access
Step 1: Check your robots.txt
Open yoursite.com/robots.txt in a browser. Look for any rules blocking GPTBot, ClaudeBot, PerplexityBot, Google-Extended, or Applebot-Extended. Common patterns to remove:
# Bad: blocks AI crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
Replace with explicit allow rules or omit specific bot directives so they fall under your general allow.
Step 2: Test JavaScript rendering
Disable JavaScript in your browser (Chrome DevTools → Command Palette → Disable JavaScript). Reload your site. Can you still read all the substantive content? If important sections (service descriptions, provider bios, FAQ) are missing, you have a JavaScript-rendering problem affecting AI crawlers.
Step 3: Test render with viewing source
View the page source (Ctrl+U or Cmd+U). Search for substantive content from the visible page. If the content is in the source, it's in the HTML; AI crawlers can read it. If the content is missing from source, it's JS-rendered and crawlers may miss it.
Step 4: Use Google Search Console URL inspection
Google's URL Inspection tool shows you what Googlebot sees. While AI crawlers differ slightly, Googlebot's view is the closest free reference. If Googlebot can't see your content, AI crawlers probably can't either.
Step 5: Submit a fresh sitemap
An up-to-date sitemap.xml submitted to Google Search Console helps crawlers discover all your pages. AI crawlers often follow Google's discovery patterns.
Common Crawler-Visibility Issues in Healthcare Sites
Healthcare sites have several patterns that often cause crawler problems:
Issue 1: Provider directories rendered via JavaScript
Many healthcare-CMS templates render the provider list via JavaScript on page load. AI crawlers may see only the empty container, not the populated names and credentials.
Issue 2: Service descriptions in modals or accordions
Content hidden behind click-to-expand is sometimes invisible to crawlers that don't simulate clicks.
Issue 3: PDFs of patient forms or service-detail sheets
Critical content rendered only as PDF is harder for AI to parse than equivalent HTML.
Issue 4: Patient-portal content blocking the main site
Sites where the patient portal is the primary surface, with the marketing site as secondary, often have authentication blockers preventing AI access.
Issue 5: Excessive third-party scripts slowing render
Analytics, chat widgets, scheduling widgets, marketing tags — each adds load time. Cumulative slowness can push render past crawler timeouts.
Common mistake: Assuming that "the site looks fine to a customer" means the site looks fine to AI crawlers. The two perspectives diverge significantly when JavaScript rendering, modal-hidden content, or slow scripts are involved. Audit crawlability specifically — don't assume.
See What AI Crawlers Actually See On Your Site
Our free scan emulates AI crawler behavior on your site, identifies content that's invisible to crawlers, and produces a prioritized fix plan.
Run Your Free Crawl AuditPractical Fixes for a West Columbia Dermatology Clinic
Fix 1: Allow AI crawlers in robots.txt
Default-allow unless you have specific reason to block. For a dermatology clinic wanting AI visibility, the defaults should be:
User-agent: *
Allow: /
Disallow: /patient-portal/
Disallow: /admin/
Sitemap: https://yoursite.com/sitemap.xml
Fix 2: Server-side render or pre-render critical content
If your CMS uses heavy JavaScript, configure server-side rendering for service-page content, provider-bio content, and any FAQ. The content should be in the initial HTML response.
Fix 3: Move content out of modals and accordions
Make substantive content visible in the DOM by default. CSS can still collapse it visually if you want the accordion UX, but the content should exist in the HTML for crawlers to read.
Fix 4: Convert critical PDFs to HTML
Patient-information sheets, service-detail documents, "what to expect" guides — render as HTML pages. PDFs are second-class content for AI parsing.
Fix 5: Reduce script-load weight
Audit third-party scripts. Remove or defer those that aren't essential. Reduce render-blocking impact.
Fix 6: Submit a clean sitemap
Generate a current XML sitemap; submit to Google Search Console; ensure it lists every page you want crawled.
Fix 7: Use Google Search Console for monitoring
Monitor coverage and indexing status. Pages excluded from Google's index typically face AI-crawler problems too.
What Happens After AI Crawlers Index Your Content
Once content is successfully crawled and indexed:
Retrieval at query time
When a relevant user query arrives, the AI retrieves your content as a candidate. Strong indexed presence increases retrieval probability.
Recency-weighted retrieval
Recently-updated content is preferred. Stale content (last updated 18 months ago) gets de-prioritized.
Cross-reference checking
Your content is cross-checked against other indexed sources. Consistency strengthens; inconsistency weakens.
Quote extraction
For FAQ schema content and other quote-ready blocks, the AI extracts specific quotes for use in answers.
Trust signaling
Authorship, credentials, and citation patterns shape how confidently the AI uses your content in recommendations.
Common mistake: Confusing "crawl-able" with "well-indexed." A page that crawlers can technically access but contains thin content or weak schema gets indexed but rarely retrieved or cited. Crawlability is the floor; content quality and structure determine actual visibility.
Why West Columbia dermatology clinics have a clean opening: The West Columbia / Cayce / Lexington-County dermatology market has roughly 4-6 practices, with most running on healthcare-CMS templates that have at least one significant crawler-visibility issue (heavy JS rendering, hidden FAQ content, PDF-locked patient info). A clinic that fixes these issues plus invests in AI-friendly content typically becomes the AI's default named recommendation for several specialty queries within 90-120 days.
The Bottom Line
Crawling and indexing remain essential infrastructure for AI search visibility. The West Columbia dermatology clinic with clean crawlability and well-indexed substantive content gets named when the parent asks ChatGPT about her teenager's mole. The clinic with JavaScript-rendered content or hidden FAQ blocks does not — and the crawler-visibility gap is often invisible to the owner until they audit specifically. Verify rather than assume.
Start today: Open your site with JavaScript disabled. Read what's visible. If important content is missing — provider bios, service descriptions, FAQ — that's your first day of crawler-visibility work. The fix usually unlocks substantial AI-visibility lift.
Get a Crawler-Visibility Audit and Fix Plan
Our free scan tests your site's accessibility to GPTBot, ClaudeBot, PerplexityBot, and other AI crawlers — and emails you a prioritized fix plan.
Run Your Free Crawlability PlanSources & Further Reading
- OpenAI: GPTBot and OAI-SearchBot documentation
- Anthropic: ClaudeBot documentation
- Perplexity AI: PerplexityBot documentation
- Google: Google-Extended documentation and robots.txt guidance (2024-2026)
- Apple: Applebot-Extended documentation
- Schema.org: MedicalBusiness, Dermatologist, Service, Person type documentation
- Google Search Console: URL inspection and coverage tools
- American Academy of Dermatology (AAD): Practice marketing and patient-communication guidance
- Heaston Innovations engagements: observed crawler-visibility outcomes across Midlands healthcare, dermatology, and professional-services practices (2024-2026)
Note: The 10-30% invisible-content figure reflects observed averages in Heaston Innovations engagements across small-business sites; specific CMS and category variation matters. The West Columbia dermatology examples are illustrative.
Free Optimization Scan