How AI agents read your docs

AI agents read your docs in a mechanical four-step loop: they crawl your pages, parse the response, structure the content for retrieval, and cite the source when answering a user's question. Every step has a way to fail. Most documentation sites fail at more than one of them.

The four-step loop

The loop is sequential. An agent that cannot crawl a page cannot parse it. An agent that cannot parse it cannot structure it. An agent that has not structured it cannot cite it. Hard caps in Obaron's rubric reflect this — blocking AI bots at Step 1 caps a site's entire AI Readiness Score regardless of how good the rest of its structure would have been, because the rest of the loop never runs.

The four steps:

  1. Crawl. The agent's named user-agent bot fetches the URL over HTTP. It honors robots.txt, follows redirects, respects rate limits, and parses the response status.
  2. Parse. The agent extracts the structured signals from the HTML response — schema markup, semantic HTML elements, agent-metadata files, code-block semantics, meta tags. The structured signals are what make the page legible as a type of content rather than just text.
  3. Structure. The agent indexes the parsed content into its retrieval system. The site's organization — its sitemap, its breadcrumbs, its llms.txt, its internal linking — informs how the agent associates this page with related queries.
  4. Cite. When a user asks the agent a question, the agent matches the query against indexed content, extracts a snippet, and emits a citation linking to the source. The citation is the loop's payoff.

AI Readiness measures how well AI systems can understand, retrieve, cite, and act on your content (see /docs/ai-readiness for the term introduction). The four steps above are the mechanical loop the rubric measures. Each step maps to one or more categories in Obaron's eight-category rubric. The mapping:

  • Step 1 (Crawl) is measured by AI Crawlability & Access.
  • Step 2 (Parse) is measured by Structured Data and the page-level half of Documentation Patterns (TechArticle / APIReference / HowTo schema, code-block markup).
  • Step 3 (Structure) is measured by Site Architecture / Navigation and the site-level half of Documentation Patterns (the agent-metadata trio: llms.txt, agents.md, .well-known/mcp.json).
  • Step 4 (Cite) is measured by AEO Readiness and Title & Identity.
  • Cross-cutting: SSR & AI Rendering and Content Structure run across Steps 2 and 3 — both steps depend on the response containing semantic structure in raw HTML.

Documentation Patterns appears at two steps because it bundles two distinct signal sets — page-level schema (Step 2) and site-level metadata files (Step 3). The category is one rubric entry; the underlying signals split across steps in the loop.

Step 1 — Crawl

Step 1 is the gate. An agent that can't reach your page can't read it. The agent's bot identifies itself with a named user-agent — typically one of GPTBot, ClaudeBot, CCBot, OAI-SearchBot, or Googlebot — and makes an HTTP request. Before requesting the page, the bot fetches /robots.txt and checks whether its user-agent is allowed.

The standard failure modes at this step:

  • robots.txt denies the AI bot's user-agent. This is the single most expensive mistake a documentation site can make against AI Readiness. Often unintentional — the wildcard rule Disallow: / from a maintenance-mode period leaks into production, or a security review adds an explicit User-agent: GPTBot / Disallow: / block that nobody revisits.
  • HTTPS misconfiguration or expired certificate. The agent's bot fails the TLS handshake and treats the site as unreachable.
  • Aggressive rate-limiting. The agent makes a polite request; the site returns 429 Too Many Requests; the agent backs off and may not retry within its discovery window.
  • noindex headers or meta tags. The agent honors these as do not surface in answers — even if the content is otherwise excellent.
  • Geographic blocking. Some bot infrastructure runs in cloud regions that the site's WAF rules exclude.

The recovery is small, well-documented, and one-time:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: CCBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Googlebot
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

A working robots.txt, a current SSL certificate, and a sitemap. For the deep treatment of bot-control mechanics — robots.txt semantics, X-Robots-Tag headers, meta name="robots", and how Obaron's own scanner respects them — see /docs/respecting-bot-controls.

Step 2 — Parse

Parse failures are the failure class that surprises teams most — Step 1 is a one-time configuration; Step 2 touches every page. The agent receives the HTML response and parses it. Two extractions happen simultaneously: prose extraction (the body text the agent will eventually cite) and signal extraction (the structured metadata that tells the agent what kind of content the page is).

The standard failure modes:

  • Schema absent. A page about "Stripe webhooks" without TechArticle schema is, to the agent, just text under an H1. With TechArticle schema, it's a known-shape technical article with author, date, dependencies, and a structured place in a content type the agent recognizes. The presence of schema is the difference between citation-as-a-relevant-page and citation-as-a-canonical-source.
  • Schema malformed. Required fields missing. Dates in non-ISO formats. JSON-LD that fails to parse or fails schema validation. The agent treats malformed schema as no schema, with the additional cost of a wasted parse attempt.
  • Critical content rendered after JavaScript executes. The agent's parse runs against the raw HTML response. If the documentation content arrives via client-side fetch and React hydration, the agent sees an empty <main> element while the human sees the full docs after hydration. Static prerender or server-side rendering closes this gap.
  • Code blocks decorated past recognition. Syntax-highlighted code rendered as colored <span> elements without <code> and <pre> semantics breaks the agent's "this is code" handle. The agent loses the structural cue that the block is meant to be copied verbatim.
  • Heading hierarchy that skips levels. An h1 followed directly by an h3 confuses the agent's outline extraction. The agent uses heading hierarchy to identify section boundaries; skips break the section model.

The recovery is a markup pass, not a rewrite. TechArticle / APIReference / HowTo schema where applicable. Real <code> and <pre> elements with lang attributes. Heading hierarchy that flows. Critical content in the raw HTML response, not deferred to client-side rendering.

Step 3 — Structure

Step 3 is the most commonly-overlooked step. After parsing, the agent indexes the page into its retrieval system. Modern AI agents combine full-text indexing, embedding-based similarity retrieval, and structured-data lookup. The agent associates each page with a topic, an author, a date, a content type, and a position in the site's hierarchy. That association is what determines whether the page surfaces when a user asks a related question.

The standard failure modes:

  • No llms.txt. The agent has no top-level summary of what the site is and what's on it. It infers from crawled pages, which is slower, less complete, and more error-prone than reading a curated index.
  • No agents.md. The agent has no contract for how it should interact with the site — preferred rate, allowed surfaces, available tools, contact for questions. Without one, the agent uses its defaults. (The agents.md specification is in flux as of 2026 — multiple proposals are circulating across Markdown, YAML, and JSON variants. Ship a structurally-valid file in whichever format you prefer; agents that read the file today are forgiving on format and strict on the absence of one.)
  • No sitemap, or a stale one. The agent's coverage is patchy because it doesn't know which pages exist. Sites that lean on internal-link discovery alone leave entire branches uncovered.
  • Breadcrumb absence. The agent cannot tell where a page lives in the site hierarchy, which weakens topical-association signals. A page about "Stripe webhooks" deep in /docs/payments/integrations/webhooks/ should associate with payments, integrations, and webhooks; without breadcrumbs, the agent has only the URL path to infer from.
  • Internal-link patterns that orphan pages. Pages reachable only from the homepage, with no within-site cross-links to related pages, weaken the topical-association graph the agent builds.

The recovery is a structural pass: ship the agent-metadata trio (llms.txt, agents.md, .well-known/mcp.json), maintain a fresh sitemap, add BreadcrumbList schema on multi-level pages, and cross-link related articles within the site. Developers think about robots.txt (Step 1) and schema (Step 2) but rarely think about how their content is organized from the agent's perspective.

Step 4 — Cite

The payoff. When a user asks the agent a question, the agent matches the query against indexed content, extracts a passage that answers the question, and emits a citation linking to the source.

The standard failure modes:

  • No clear thesis or single-question answer. The agent extracts a paragraph that doesn't answer the user's specific question. Pages that try to be everything to everyone often end up cited for nothing in particular.
  • Citation snippet buried below preamble. The agent typically extracts the first content-bearing paragraph after the H1. Pages that open with marketing copy or "in this article we'll cover…" force the agent to extract preamble instead of the answer.
  • URLs that change on republish. A citation the agent emitted last month no longer resolves; the user clicks a 404. Some agents revalidate and update cached citations; many do not.
  • No canonical URL. The agent picks one of several URL variants (with or without query string, with or without trailing slash) and the cited URL doesn't match the page's authoritative form.
  • Page title that doesn't match the content. The agent surfaces the citation with the page's <title> as the link label. If the title is clever rather than descriptive, the citation card displays the wrong context and the user doesn't click.

The recovery: write articles with single theses. Put the answer first — the first content paragraph after the H1 is what agents extract. Keep URLs stable. Serve canonical URLs. Title each page to its primary topic, plainly.

Why most docs fail

In the scans we've run, most public developer documentation sites fall below 50. That's not because the content is poor. It's because the structural signals were optimized for a different consumer.

The pattern is consistent across sites that look great in a browser. A beautifully-designed marketing site for a developer tool, built on a no-code platform with no schema markup, JavaScript-injected content, and no llms.txt, will score poorly on AI Readiness even when it ranks well on Google. Humans love it; agents miss it. The reverse pattern is also common. An unstyled reference page in the MDN tradition — TechArticle schema everywhere, raw-HTML content, version-pinned dates — may not rank as visibly on Google search but tends to surface cleanly when AI assistants are asked the question the page answers. For the deep comparison of what AI Readiness measures versus what SEO measures, see /docs/aeo-vs-seo.

The failures cluster at Steps 2 and 3. Step 1 (Crawl) is mostly a one-time configuration; sites that pass it tend to keep passing it. Step 4 (Cite) follows from upstream investment — pages that ace Steps 1–3 generally get cited cleanly. The bulk of the score gap lives in the middle of the loop: schema absent, llms.txt absent, breadcrumb absent, content rendered post-hydration. The fixes are small individually; they compound when applied together.

Low AI Readiness scores are not an indictment of content quality. They're an indictment of evaluation. The site is being evaluated for the wrong consumer. The fix is rarely a rewrite; it is a markup pass, an llms.txt, and a schema audit.

What you can do today

Three concrete steps:

  1. Run a free Lightning Scan of your site. It returns a single-page AI Readiness Score against the canonical rubric in roughly 30 seconds, with the per-category breakdown that maps directly to the four loop steps in this article.
  2. Read /methodology. It is the canonical definition of what each rubric category measures, including the determinism property and hard caps that Step 1 surfaces.
  3. Read /docs/respecting-bot-controls for Step 1 in full. Additional per-step articles are in /docs as they ship.

The articles in /docs exist to make AI Readiness an addressable property, one decision at a time, with examples that copy-paste cleanly.