All playbooks
Playbook · AEO·v1.0

The AEO Audit Playbook

A scored AEO audit of your site across 7 weighted dimensions, with a ranked list of fixes ordered by impact-to-effort.

Time to run

60–90 minutes

Formats

Web · Notion · Skill

Version

v1.0

Who this is for

  • Founders, CMOs, or Heads of Demand at B2B companies $1M–$50M
  • Teams watching organic traffic flatten while ChatGPT and Perplexity get cited more often
  • Anyone who suspects AI engines are now their leakiest channel but can't prove it

Who this is NOT for

  • Pure e-commerce SKU pages (different game)
  • Local-services businesses (geo-AEO is its own beast)
  • Pre-revenue companies — fix product and positioning first

The problem

In 2026, AI engines are the new top of funnel for B2B buying research. ChatGPT, Perplexity, Gemini, and Claude don't return ten blue links — they return one synthesized answer. If your brand isn't in that answer, you don't exist for the buyer doing the research. Most teams discover the problem six months late, after pipeline has already softened. The cost of inaction is not 'we'll fall behind in SEO.' The cost is buyers stop seeing your name in the comparison phase, and your sales team starts hearing 'we hadn't heard of you' on every first call.

What you'll have

A single-page AEO Health Score (0–100), a per-dimension breakdown showing where you're strong and where you're invisible, and a ranked fix list — the top three things to attack first.

Before you start

  • Your own site URL
  • A ChatGPT or Claude accountFree tier works
  • A Perplexity accountFree tier works
  • 60–90 minutes of focused time
  • Optional: Ahrefs or Semrush for traffic baselines~$99/mo

The mental model

The Catalyst AEO Scoring Matrix

The mistake most teams make when thinking about AEO is treating it as 'SEO for AI.' It isn't. SEO ranks pages. AEO ranks entities. The AI engine has to (a) know you exist, (b) know what category you belong in, (c) trust your authority in that category, and (d) be able to cite you in a synthesized answer.

The Catalyst AEO Scoring Matrix breaks that into seven weighted dimensions. Each is scored 0–5, multiplied by its weight, and rolled into a 0–100 AEO Health Score.

The seven dimensions and their weights

1. Entity clarity (15%) — does the model know who you are and what you do?
2. Topical authority depth (20%) — do you have meaningful coverage of your category?
3. Citation surface (20%) — are you being cited in answers, not just listed?
4. Schema & structured data (10%) — is your structured data clean?
5. llms.txt & crawlability (10%) — can AI crawlers access and parse your content?
6. Comparison query coverage (15%) — do you show up in 'X vs Y' queries?
7. Freshness signal (10%) — are your pages dated and currency-signaling?

The bands

0–40: Invisible. The model doesn't know you exist or knows you wrongly.
41–60: Emerging. You appear in some answers but rarely cited.
61–80: Contender. You show up in most relevant answers, sometimes as the cited source.
81+: Category leader. You're the first or second name in synthesized answers for your category.

The proprietary takeaway: the two heaviest weights — Topical Authority Depth and Citation Surface — are the levers most teams under-invest in and most SEO agencies under-deliver on. If you have 90 hours to spend, spend 65 of them on those two dimensions.

The procedure

8 steps. Run in order.

  1. 1

    Set up your baseline

    Action

    Open a fresh ChatGPT, Claude, and Perplexity tab. In each, run this prompt verbatim: "I'm researching [your category, e.g. 'B2B AI-powered sales engagement platforms']. Who are the top 5 companies I should evaluate, and why?" Copy each response into a spreadsheet with one column per engine. Don't comment yet — collect raw data.

    Why

    AEO is engine-specific. ChatGPT, Claude, and Perplexity each weigh signals differently — ChatGPT leans on its training corpus, Perplexity leans on real-time web, Claude blends both. Auditing only one engine gives you a misleading score.

    What good looks like

    Three responses, each naming 5 companies, with explicit reasoning. You can see whether your company is named in 0, 1, 2, or 3 of them.

    Common failure mode

    Running a softball brand prompt like 'is [my company] good?' and getting a flattering answer. The model is sycophantic. You must ask a category question, not a brand question.

  2. 2

    Score Entity Clarity (Dimension 1, 15% weight)

    Action

    In each engine, run: "What is [your company]? What does it do, who is it for, and what's distinctive about it?" Score 0–5: • 5 — All three engines describe you correctly, distinctively, with the right ICP. • 3 — Engines describe you generically but miss your distinctive angle. • 1 — At least one engine confuses you with another company. • 0 — Engines have no idea who you are.

    Why

    Entity clarity is foundational. If the model doesn't know what you are, no amount of citation work will fix the upstream problem.

    What good looks like

    Three responses that each get your one-line positioning right and mention your differentiator.

    Common failure mode

    Founders score themselves a 5 because they understand the description. Score for what a stranger reading it would conclude — not for what you'd write yourself.

  3. 3

    Score Topical Authority Depth (Dimension 2, 20% weight)

    Action

    Run in each engine: "What does [your company] specifically say about [your most important sub-topic]? Cite the best resources from them." Score 0–5: • 5 — Engine quotes a specific framework, statistic, or POV, with a citation to your page. • 3 — Engine paraphrases generic ideas, attributes loosely. • 1 — Engine says 'they appear to focus on [X]' with no specifics. • 0 — Engine has no specific content from you.

    Why

    Topical authority depth is the largest weighted dimension. If the model can't find you on the sub-topics that matter to your buyers, your category-level visibility is decorative.

    What good looks like

    Engine returns a specific framework name, a named statistic, or a direct quote — with a citation linking to your site.

    Common failure mode

    Confusing breadth for depth. 200 thin posts beat zero — but five pieces of definitive deep coverage beat 200 lightweight ones every time.

  4. 4

    Score Citation Surface (Dimension 3, 20% weight)

    Action

    In Perplexity (it shows sources), run three buyer-intent queries: 1. "What's the best [your category] for [your ICP]?" 2. "How should I evaluate [your category] vendors?" 3. "What questions should I ask a [your category] vendor before buying?" Count: in how many is your domain in the cited sources? • 5 — Cited in all 3. • 3 — Cited in 1–2. • 1 — Not cited, but mentioned in body text. • 0 — Neither cited nor mentioned.

    Why

    Citation surface converts mentions into traffic. A mention with no citation drives zero clicks. A citation in a buyer-intent query is the AI-era equivalent of ranking on page one.

    What good looks like

    Your domain appears in source citations for buyer-intent queries — not just brand queries.

    Common failure mode

    Optimizing for being mentioned instead of being cited. They are different problems. Mentions come from being well-known. Citations come from having the most useful single page on the specific question.

  5. 5

    Score Schema & Structured Data (Dimension 4, 10% weight)

    Action

    Run your homepage and top 3 product/solution pages through Google's Rich Results Test. Note which schema types are present and which throw errors. • 5 — Organization, Product/Service, FAQPage, and Article schema all present and valid. • 3 — Some schema present, with gaps or errors. • 1 — Minimal schema. • 0 — None.

    Why

    AI engines preferentially cite content with valid structured data because it gives them confidence about facts. Highest return-per-hour-spent dimension — usually fixable in one eng sprint.

    What good looks like

    No errors in the Rich Results Test, all four core schema types present on relevant pages.

    Common failure mode

    Trusting that 'Webflow handles it' or that the developer 'set it up.' Verify with the actual test — most stacks have gaps.

  6. 6

    Score llms.txt & Crawlability (Dimension 5, 10% weight)

    Action

    Visit https://yourdomain.com/llms.txt. If it 404s, you're at 0. • 5 — Comprehensive llms.txt with key pages, company description, key topics. • 3 — Basic llms.txt with some content. • 1 — Exists but is incomplete or generic. • 0 — No llms.txt, or robots.txt blocks AI crawlers.

    Why

    llms.txt is to AI engines what sitemap.xml was to Google in 2005. Early signal, cheap to fix, disproportionate impact in 2026.

    What good looks like

    Purpose-written llms.txt that tells the model what you are, your most important pages, and what you want to be cited for.

    Common failure mode

    Copying a template llms.txt verbatim. The model gives no value to generic ones — substance matters.

  7. 7

    Score Comparison Query Coverage (Dimension 6, 15% weight)

    Action

    In ChatGPT and Perplexity, run: "Compare [your company] vs [each top competitor]. What are the trade-offs?" — once per competitor (3 prompts total). • 5 — Engines name accurate, distinctive trade-offs that match your positioning. You come off favorably on your strengths. • 3 — Engines name generic trade-offs without specifics. • 1 — Engines position you as inferior or get the comparison wrong. • 0 — Engines refuse to compare or have no information.

    Why

    Comparison queries are the highest-intent moment in the buyer journey. They're where deals are won and lost. Most companies have zero content addressing comparisons head-on and lose by default.

    What good looks like

    Engines cite a specific comparison asset on your site — a 'vs' page, comparison post, or teardown.

    Common failure mode

    Refusing to publish comparisons 'because we don't want to engage with competitors.' The market is comparing you regardless. You either control the narrative or your competitors do.

  8. 8

    Score Freshness Signal (Dimension 7, 10%) + final scoring

    Action

    Check three things: 1. Are your top pages dated with a visible 'Last updated' line? 2. Ask ChatGPT: 'What's the most recent thing [your company] has published?' — surfaces anything from the last 90 days? 3. Is your sitemap.xml current? • 5 — Dated pages, recent freshness in AI answers, current sitemap. • 3 — Some dating but stale answers. • 1 — Minimal dating, mostly stale. • 0 — No freshness signal. Final scoring formula: AEO Health Score = sum over all 7 dimensions of (score × weight × 20) A perfect score = 100. Mid-3s across = ~60. Bands: 0–40 invisible, 41–60 emerging, 61–80 contender, 81+ category leader.

    Why

    Freshness is the cheapest dimension to fix and the most overlooked. Models actively de-prioritize content they think is stale.

    What good looks like

    Every important page has a visible 'Last updated 2026-Q2' line, and the model surfaces content from the last 90 days. Final score is calibrated honestly — at least one ≤3 and at least one ≥4.

    Common failure mode

    Updating the date without updating the content. Models are getting better at detecting this. Update the substance, then update the date.

Decision points

Where this branches.

Audit-first or fix-first?

Score < 40 → audit-first. Risk of fixing the wrong thing first is too high. Spend an extra week getting baseline-correct data. Score 40–70 → fix-first. You have enough signal. Start fixing the two highest-weighted dimensions (Topical Authority Depth + Citation Surface) and re-audit in 60 days. Score 70+ → defensive operations. You're winning. The work is keeping you there. Monthly re-scoring becomes the rhythm.

In-house or commission outside help?

Technical dimensions (Schema, llms.txt, Freshness) are almost always in-house — one eng sprint. Content dimensions (Topical Authority, Comparison Coverage) are usually where outside help compounds. Commission when: you don't have a content team that can write with strong POVs, or your team can write but isn't producing on a weekly cadence. Keep in-house when: you have a strong opinion-haver on the team and only need production/distribution help.

Content investment or technical investment first?

If Schema and llms.txt both score below 3, fix them first. One-week eng sprint, and they unlock more value from any subsequent content work. If they're already 3+, content is the higher lever — and the only path to moving Topical Authority Depth and Citation Surface, which are 40% of your total score.

QA pass

Before you ship, check:

  • You ran each prompt verbatim, not paraphrased — wording matters
  • You ran at least two engines (ideally three) for each dimension
  • You scored for what a stranger would conclude, not what you'd write yourself
  • Your scores have at least one ≤3 and at least one ≥4 — same number everywhere means you're not really scoring
  • You named the top three fixes specifically (not 'improve our content') with an owner and a due date

Run this playbook

Get all three formats

Web view, Notion template, and the Claude Skill — delivered to your inbox.

Sent instantly
Free
Unsubscribe anytime
Playbook v1.0 · Published May 25, 2026