How AI agents read your docs
AI agents read your docs in a mechanical four-step loop: they crawl your pages, parse the response, structure the content for retrieval, and cite the source when answering a user's question. Every step has a way to fail. Most documentation sites fail at more than one of them.
The four-step loop
The loop is sequential. An agent that cannot crawl a page cannot parse it. An agent that cannot parse it cannot structure it. An agent that has not structured it cannot cite it. Hard caps in Obaron's rubric reflect this — blocking AI bots at Step 1 caps a site's entire AI Readiness Score regardless of how good the rest of its structure would have been, because the rest of the loop never runs.
The four steps:
- Crawl. The agent's named user-agent bot fetches the URL over HTTP. It honors
robots.txt, follows redirects, respects rate limits, and parses the response status. - Parse. The agent extracts the structured signals from the HTML response — schema markup, semantic HTML elements, agent-metadata files, code-block semantics, meta tags. The structured signals are what make the page legible as a type of content rather than just text.
- Structure. The agent indexes the parsed content into its retrieval system. The
site's organization — its sitemap, its breadcrumbs, its
llms.txt, its internal linking — informs how the agent associates this page with related queries. - Cite. When a user asks the agent a question, the agent matches the query against indexed content, extracts a snippet, and emits a citation linking to the source. The citation is the loop's payoff.
AI Readiness measures how well AI systems can understand, retrieve, cite, and act on your content (see /docs/ai-readiness for the term introduction). The four steps above are the mechanical loop the rubric measures. Each step maps to one or more categories in Obaron's eight-category rubric. The mapping:
- Step 1 (Crawl) is measured by AI Crawlability & Access.
- Step 2 (Parse) is measured by Structured Data and the page-level half of
Documentation Patterns (
TechArticle/APIReference/HowToschema, code-block markup). - Step 3 (Structure) is measured by Site Architecture / Navigation and the
site-level half of Documentation Patterns (the agent-metadata trio:
llms.txt,agents.md,.well-known/mcp.json). - Step 4 (Cite) is measured by AEO Readiness and Title & Identity.
- Cross-cutting: SSR & AI Rendering and Content Structure run across Steps 2 and 3 — both steps depend on the response containing semantic structure in raw HTML.
Documentation Patterns appears at two steps because it bundles two distinct signal sets — page-level schema (Step 2) and site-level metadata files (Step 3). The category is one rubric entry; the underlying signals split across steps in the loop.
Step 1 — Crawl
Step 1 is the gate. An agent that can't reach your page can't read it. The agent's bot identifies
itself with a named user-agent — typically one of GPTBot,
ClaudeBot, CCBot, OAI-SearchBot, or Googlebot —
and makes an HTTP request. Before requesting the page, the bot fetches /robots.txt and
checks whether its user-agent is allowed.
The standard failure modes at this step:
-
robots.txtdenies the AI bot's user-agent. This is the single most expensive mistake a documentation site can make against AI Readiness. Often unintentional — the wildcard ruleDisallow: /from a maintenance-mode period leaks into production, or a security review adds an explicitUser-agent: GPTBot/Disallow: /block that nobody revisits. - HTTPS misconfiguration or expired certificate. The agent's bot fails the TLS handshake and treats the site as unreachable.
- Aggressive rate-limiting. The agent makes a polite request; the site returns
429 Too Many Requests; the agent backs off and may not retry within its discovery window. -
noindexheaders or meta tags. The agent honors these as do not surface in answers — even if the content is otherwise excellent. - Geographic blocking. Some bot infrastructure runs in cloud regions that the site's WAF rules exclude.
The recovery is small, well-documented, and one-time:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: CCBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Googlebot
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
A working robots.txt, a current SSL certificate, and a sitemap. For the deep treatment
of bot-control mechanics — robots.txt semantics, X-Robots-Tag headers,
meta name="robots", and how Obaron's own scanner respects them — see
/docs/respecting-bot-controls.
Step 2 — Parse
Parse failures are the failure class that surprises teams most — Step 1 is a one-time configuration; Step 2 touches every page. The agent receives the HTML response and parses it. Two extractions happen simultaneously: prose extraction (the body text the agent will eventually cite) and signal extraction (the structured metadata that tells the agent what kind of content the page is).
The standard failure modes:
- Schema absent. A page about "Stripe webhooks" without
TechArticleschema is, to the agent, just text under an H1. WithTechArticleschema, it's a known-shape technical article with author, date, dependencies, and a structured place in a content type the agent recognizes. The presence of schema is the difference between citation-as-a-relevant-page and citation-as-a-canonical-source. - Schema malformed. Required fields missing. Dates in non-ISO formats. JSON-LD that fails to parse or fails schema validation. The agent treats malformed schema as no schema, with the additional cost of a wasted parse attempt.
- Critical content rendered after JavaScript executes. The agent's parse runs
against the raw HTML response. If the documentation content arrives via client-side fetch and
React hydration, the agent sees an empty
<main>element while the human sees the full docs after hydration. Static prerender or server-side rendering closes this gap. - Code blocks decorated past recognition. Syntax-highlighted code rendered as
colored
<span>elements without<code>and<pre>semantics breaks the agent's "this is code" handle. The agent loses the structural cue that the block is meant to be copied verbatim. - Heading hierarchy that skips levels. An
h1followed directly by anh3confuses the agent's outline extraction. The agent uses heading hierarchy to identify section boundaries; skips break the section model.
The recovery is a markup pass, not a rewrite. TechArticle / APIReference
/ HowTo schema where applicable. Real <code> and
<pre> elements with lang attributes. Heading hierarchy that flows.
Critical content in the raw HTML response, not deferred to client-side rendering.
Step 3 — Structure
Step 3 is the most commonly-overlooked step. After parsing, the agent indexes the page into its retrieval system. Modern AI agents combine full-text indexing, embedding-based similarity retrieval, and structured-data lookup. The agent associates each page with a topic, an author, a date, a content type, and a position in the site's hierarchy. That association is what determines whether the page surfaces when a user asks a related question.
The standard failure modes:
- No
llms.txt. The agent has no top-level summary of what the site is and what's on it. It infers from crawled pages, which is slower, less complete, and more error-prone than reading a curated index. - No
agents.md. The agent has no contract for how it should interact with the site — preferred rate, allowed surfaces, available tools, contact for questions. Without one, the agent uses its defaults. (Theagents.mdspecification is in flux as of 2026 — multiple proposals are circulating across Markdown, YAML, and JSON variants. Ship a structurally-valid file in whichever format you prefer; agents that read the file today are forgiving on format and strict on the absence of one.) - No sitemap, or a stale one. The agent's coverage is patchy because it doesn't know which pages exist. Sites that lean on internal-link discovery alone leave entire branches uncovered.
- Breadcrumb absence. The agent cannot tell where a page lives in the site
hierarchy, which weakens topical-association signals. A page about "Stripe webhooks" deep in
/docs/payments/integrations/webhooks/should associate with payments, integrations, and webhooks; without breadcrumbs, the agent has only the URL path to infer from. - Internal-link patterns that orphan pages. Pages reachable only from the homepage, with no within-site cross-links to related pages, weaken the topical-association graph the agent builds.
The recovery is a structural pass: ship the agent-metadata trio (llms.txt,
agents.md, .well-known/mcp.json), maintain a fresh sitemap, add
BreadcrumbList schema on multi-level pages, and cross-link related articles within
the site. Developers think about robots.txt (Step 1) and schema (Step 2) but rarely
think about how their content is organized from the agent's perspective.
Step 4 — Cite
The payoff. When a user asks the agent a question, the agent matches the query against indexed content, extracts a passage that answers the question, and emits a citation linking to the source.
The standard failure modes:
- No clear thesis or single-question answer. The agent extracts a paragraph that doesn't answer the user's specific question. Pages that try to be everything to everyone often end up cited for nothing in particular.
- Citation snippet buried below preamble. The agent typically extracts the first content-bearing paragraph after the H1. Pages that open with marketing copy or "in this article we'll cover…" force the agent to extract preamble instead of the answer.
- URLs that change on republish. A citation the agent emitted last month no longer resolves; the user clicks a 404. Some agents revalidate and update cached citations; many do not.
- No canonical URL. The agent picks one of several URL variants (with or without query string, with or without trailing slash) and the cited URL doesn't match the page's authoritative form.
- Page title that doesn't match the content. The agent surfaces the citation with
the page's
<title>as the link label. If the title is clever rather than descriptive, the citation card displays the wrong context and the user doesn't click.
The recovery: write articles with single theses. Put the answer first — the first content paragraph after the H1 is what agents extract. Keep URLs stable. Serve canonical URLs. Title each page to its primary topic, plainly.
Why most docs fail
In the scans we've run, most public developer documentation sites fall below 50. That's not because the content is poor. It's because the structural signals were optimized for a different consumer.
The pattern is consistent across sites that look great in a browser. A beautifully-designed
marketing site for a developer tool, built on a no-code platform with no schema markup,
JavaScript-injected content, and no llms.txt, will score poorly on AI Readiness even
when it ranks well on Google. Humans love it; agents miss it. The reverse pattern is also common.
An unstyled reference page in the MDN tradition — TechArticle schema everywhere,
raw-HTML content, version-pinned dates — may not rank as visibly on Google search but tends to
surface cleanly when AI assistants are asked the question the page answers. For the deep
comparison of what AI Readiness measures versus what SEO measures, see
/docs/aeo-vs-seo.
The failures cluster at Steps 2 and 3. Step 1 (Crawl) is mostly a one-time configuration; sites
that pass it tend to keep passing it. Step 4 (Cite) follows from upstream investment — pages that
ace Steps 1–3 generally get cited cleanly. The bulk of the score gap lives in the middle of the
loop: schema absent, llms.txt absent, breadcrumb absent, content rendered
post-hydration. The fixes are small individually; they compound when applied together.
Low AI Readiness scores are not an indictment of content quality. They're an indictment of
evaluation. The site is being evaluated for the wrong consumer. The fix is rarely a
rewrite; it is a markup pass, an llms.txt, and a schema audit.
What you can do today
Three concrete steps:
- Run a free Lightning Scan of your site. It returns a single-page AI Readiness Score against the canonical rubric in roughly 30 seconds, with the per-category breakdown that maps directly to the four loop steps in this article.
- Read /methodology. It is the canonical definition of what each rubric category measures, including the determinism property and hard caps that Step 1 surfaces.
- Read /docs/respecting-bot-controls for Step 1 in full. Additional per-step articles are in /docs as they ship.
The articles in /docs exist to make AI Readiness an addressable property, one decision at a time, with examples that copy-paste cleanly.