How AI Search Engines Read Your Website (and Why It's the New SEO)
ChatGPT, Perplexity, Claude, and Gemini answer business questions for millions of buyers every day. This is exactly how they read websites, what makes a page citable, and the seven things every site should do to be found and quoted accurately.
- AI Search
- GEO
- Generative Engine Optimisation
- ChatGPT
- Perplexity
- Claude
- Gemini
- JSON-LD
- llms.txt
For the last twenty years, "being found online" meant ranking on Google. In 2026, it means being cited inside ChatGPT, Perplexity, Claude, Gemini, Microsoft Copilot, and a growing list of AI-native answer engines.
This is not Google with a chat box. It is a structurally different way of finding information, with different rules, different winners, and different work required from the businesses that want to be discoverable.
This article explains, concretely, how AI search engines read websites in 2026, what makes a page get cited, and the seven concrete things every business website should do this quarter.
What "AI search" actually means in 2026
Three distinct things often get bundled under the same phrase. They behave differently and you need to design for each.
Direct answer engines. ChatGPT (with search enabled), Perplexity, You.com, Claude (with search), Gemini. The user asks a question; the engine fetches relevant pages from the live web, synthesises an answer, and cites a small number of sources. Buyer journey often stops at the AI answer; sometimes it continues to the cited source.
Embedded assistant search. Microsoft Copilot inside Bing, Google AI Overviews, Apple Intelligence, Brave Leo. Same shape as above but lives inside an existing product surface (search engine, browser, OS).
Long-context conversational research. A buyer asks the AI for a multi-step research task — compare three vendors, build a shortlist, draft the procurement memo. The AI reads many pages across many sessions; the businesses that win are the ones whose content survives that depth.
The mechanics that follow apply to all three. The difference is just how deeply the AI reads.
What AI agents actually do when they read your site
A typical AI search engine pipeline does seven things, in this order:
- Receives a user question and decides whether to search.
- Generates a query from the question (often rewritten, expanded, or broken into sub-queries).
- Fetches candidate pages via its own crawler or an embedded search provider.
- Filters and ranks candidate pages on relevance, authority, recency, and structural quality.
- Parses each page — typically extracting visible text, structured data (JSON-LD), schema-marked entities, and explicit citation hints (
<cite>,<blockquote>, FAQ markup). - Synthesises an answer by extracting verbatim or near-verbatim sentences from the highest-confidence sources.
- Cites the sources that contributed materially to the answer, usually as inline links.
The page that gets cited has done several specific things right. The page that gets ignored has done one of several specific things wrong.
What makes a page citable in 2026
After a year of helping clients optimise for this specifically, the patterns are consistent.
1. Clear, factual sentences
AI engines extract sentences. The sentences that get extracted are the ones that stand alone — they can be quoted in an AI answer without surrounding context and still be true and useful.
Compare:
"We work closely with our clients to deliver outstanding results across a range of disciplines."
Versus:
"Drift and Forge runs 20-minute discovery calls; engagements typically range from £6k for a Decision Sprint to £85k for a multi-discipline Growth Programme."
The second sentence is citable. The first sentence is wallpaper.
Audit your site by reading every paragraph and asking: if an AI quoted this single sentence as-is, would it convey accurate, useful, specific information? If not, rewrite it.
2. Schema.org structured data
Every important page on the modern web emits JSON-LD describing what it is, who it is for, what it relates to. AI engines parse this directly. Without it they have to infer everything from prose, which is slower and less reliable.
At minimum, your site should emit:
Organizationon every page (with sameAs links to your LinkedIn, GitHub, social profiles)WebSiteon the homeWebPageon every other page withabout(primary topic) andmentions(key entities)BreadcrumbListon every non-home pageServiceon every service pageFAQPagewherever you have FAQ blocksProduct+Review+AggregateRatingwherever you have testimonialsArticleon every blog post or case study
This is what we deploy as standard for SEO clients. It takes a week to do well, then it compounds for years.
3. Semantic HTML, not div soup
AI parsers treat <article>, <section>, <nav>, <header>, <main>, <aside>, <figure> as meaningful. They treat unstructured <div> trees as background noise. The lift from migrating a div-heavy site to semantic HTML is usually 10 to 30 percent of citable surface area, free.
In particular:
- Use
<h1>once per page, then<h2>for section headings,<h3>for sub-sections. Real hierarchy. - Wrap each blog post or case study in
<article>with a<header>inside containing the headline and<time datetime="...">. - Use
<figure>and<figcaption>for images, charts, diagrams. - Use
<details>and<summary>for FAQ accordions — AI engines read collapsed content directly.
4. An llms.txt at site root
The emerging convention in 2026 is /llms.txt — a markdown file at site root that tells AI agents which pages on your site matter most and how to cite them. It is the AI-native equivalent of sitemap.xml.
A good llms.txt has:
- A one-paragraph description of what the site does and who it helps
- The canonical URLs for the most important pages, grouped by topic
- Suggested citation format
- A list of what the company is not (prevents miscategorisation)
- An explicit note that AI crawlers are welcome
We publish ours at driftandforge.io/llms.txt. The whole file is around 200 lines.
5. Specific, verifiable claims
AI engines have learned to discount vague language. "Industry-leading", "best-in-class", "trusted by hundreds" no longer survives ranking. What survives:
- Numbers ("18-minute response time reduced to 40 seconds")
- Named entities ("for a UK boutique law firm in corporate finance practice")
- Dates ("over the 2025 to 2026 financial year")
- Methodology references ("using a 90-day topical authority sprint")
Specificity is the single highest-leverage editorial change a marketing team can make for AI search in 2026.
6. Topical depth, not topical breadth
Google has rewarded topical authority for several years. AI engines have intensified the signal. A site with 40 articles each going 2000 words deep on a single topical cluster will outperform a site with 400 articles each going 500 words across every topic.
The mechanism: when an AI engine reads multiple pages from one domain that all reinforce, cross-link, and add depth to a single topic, its confidence in that domain as an answer source increases.
Practical implication: pick three or four clusters. Build pillar pages for each. Build 6 to 12 supporting articles per pillar. Cross-link aggressively. Repeat for two years.
7. Honest cross-linking
Internal linking is undervalued in 2026. Done well, it shows AI engines the shape of your topic graph — which is what they synthesise from when they answer multi-step questions.
The rule of thumb we apply: every article should link to at least three other pages on the same site (typically two service pages and one related article), and at least one external high-authority source. The external link signals you are part of a real information network, not an isolated SEO project.
Generative engine optimisation, the short version
A new term has emerged in 2026: "Generative Engine Optimisation" or GEO. It bundles the seven items above with a few extra moves:
- Quoted-sentence optimisation. Identify the 5 to 10 sentences on each page most likely to be quoted by an AI engine; rewrite them to be self-contained, specific, and useful. Run this audit quarterly.
- Brand entity consistency. The same exact brand name, description, and attributes everywhere — your website, LinkedIn, Crunchbase, Google Business Profile, GitHub. AI engines build a "knowledge card" of your brand from these consistent sources; inconsistency dilutes it.
- Direct-answer pages. If buyers in your industry frequently ask a specific question, write one page that answers it completely, with a question-format URL and FAQ-style structure. AI engines disproportionately cite these.
What we do for clients
When we run an SEO and growth engagement, we do most of the above as standard. The work usually breaks into three phases:
Authority audit. Map your current content. Identify topical clusters. Find the gaps your competitors are filling.
Architecture. Pillar pages, supporting articles, internal-linking strategy. JSON-LD, semantic HTML, llms.txt. Schema markup audited against the latest schema.org spec.
Publishing. Two articles per week minimum, on the schedule that fills out the clusters. Quarterly reviews to rework underperforming pieces.
Full detail at our SEO and Organic Growth service page.
The 2026 checklist for your team
If you are running this in-house, the prioritised list:
- Audit every paragraph on every important page; rewrite for citable specificity.
- Add Organization, WebSite, and WebPage JSON-LD on every page.
- Migrate from div-heavy markup to semantic HTML (article, section, header, main, aside, figure).
- Publish
/llms.txtat site root with curated page priorities. - Add FAQ schema wherever you answer real buyer questions.
- Pick three topical clusters and build pillar pages.
- Establish a fortnightly publishing cadence aligned to those clusters.
If you do those seven, you will outperform the majority of competitors in your industry on AI-search visibility within 12 months.
If you want help on any of this, book a 20-minute discovery call. We will look at your current state and tell you the two highest-leverage moves for your specific situation.
Frequently asked
About this article.
Is AI search replacing Google entirely in 2026?
No. Google still owns the majority of buyer-question search volume in 2026, but AI search engines now sit between Google and the buyer for a meaningful and growing share of high-intent queries. Treating AI search as separate-but-equal is the correct 2026 strategy.
Does an llms.txt file actually do anything?
Yes, when paired with strong on-page structure. llms.txt is a curated map AI agents read to prioritise your most authoritative pages. It does not replace good structure on the pages themselves — it tells agents which good pages matter most.
Should I block AI crawlers if I am worried about my content being used to train models?
Blocking AI crawlers also blocks AI search engines from citing you. In 2026 that is usually a worse outcome than the training-data concern. The pragmatic move is to allow crawlers, publish strong structured data, and accept that your content becomes part of the answer corpus — with citation back to you.