We measured Lighthouse Agentic Browsing scores on 11 sites before and after kit install — here's what moved and what didn't

Most case-study posts about an AI-readiness tool show one site, one chart, a number that lifted, and a quote from the customer. The shape of this post is different on purpose. We tested the kit on 11 real sites, recorded the Lighthouse Agentic Browsing score before and after each install, and the honest version of "what the data shows" requires more than a single chart.

Headline: average score lifted from 17 → 72 across the cohort. No site dropped; the smallest lift was 37 points; the largest was 74. Range of post-install scores: 60-80.

That headline is real and reproducible. The more useful information is in the per-audit shape — which of the nine Lighthouse Agentic Browsing audits the kit consistently closed across sites, which audits it consistently couldn't, and the segmentation that explains the variance. The audits the kit moved are the ones with the highest absolute lift per dollar. The audits it didn't are the ones that need work outside the kit's scope.

Methodology — what we measured and what we didn't

Before getting to the data, the constraints worth stating up front:

The sites. 11 sites total. Mix of platforms — Shopify, WordPress/WooCommerce, Next.js, and custom stacks. Mix of business shapes — e-commerce, marketplace, hospitality, news, .SE small business. Sites range from global brands to single-operator SMBs.

The measurement. For each site:

Recorded the Lighthouse Agentic Browsing category score on the site as it existed pre-install (no kit files present at the domain root, no auto-discovery <link> tags).
Ran the BridgeToAgent generation pipeline on the site — full sitemap crawl, platform detection, file generation with critique + verify passes.
Simulated install (in test environment for sites we don't own; on the real site for the one site that was ours).
Recorded the Lighthouse Agentic Browsing category score post-install.

What we controlled for. Same Lighthouse version (12.4+ for category support), same throttling profile, same time-of-day (to control for any cache-warming variance). Audit ran against the live URL or test mirror with identical content.

What we didn't control for. We're not the canonical install vendor for any of these sites (10 of 11). The "after" score reflects what the kit's output would deliver if installed, not the score the site actually has in production today. For the one site where we ARE the install vendor (banquet.se — early customer, founder-operated), the v1 kit was already shipped before the second measurement run, so we don't have a clean baseline for that site and excluded it from the cohort.

The 10-site cohort. All numbers below reflect the 10 sites with full before/after scores. The 11th site is referenced where its operational notes inform the analysis.

Site-identity disclosure. Sites are reported anonymized by segment rather than by name. Three reasons: (1) most are not customers, and naming them implies they are, (2) the per-segment shape carries more information than per-site brand recognition, (3) honest segment-level reporting beats name-dropping when the cohort is mixed-quality on data extraction.

The score table

#	Site segment	Platform detected	Score before	Score after	Delta	Crawl quality
1	.SE leisure / niche-traffic site	Custom	30	68	+38	Healthy (7 pages)
2	Major airline	Next.js	1	70	+69	Single-page (sitemap miss)
3	Hospitality chain	Custom	1	70	+69	Single-page (sitemap miss)
4	E-commerce print-on-demand	Custom	30	67	+37	Healthy (8 pages)
5	.SE consumer site	Custom	30	70	+40	Healthy (7 pages)
6	Major Swedish news site	Custom	20	70	+50	Single-page (no sitemap)
7	Global e-commerce marketplace	Custom (bot-defended)	1	60	+59	Single-page (only 1 of 55 URLs verified)
8	Freelance services marketplace	Custom (CF-protected)	6	80	+74	Single-page (CF defense)
9	Athletic apparel e-commerce	Next.js + Shopify	15	80	+65	Healthy (8 pages)
10	Niche e-commerce	Shopify	30	80	+50	Full platform-pack match

Cohort statistics:

Average before: 16.4 → 17 (rounded)
Average after: 71.5 → 72 (rounded)
Average delta: +55 points
Minimum delta: +37 (the kit improved every site)
Maximum delta: +74
All post-install scores in the 60-80 range; none reached 80+ or fell below 60.

Visualized:

Before install              After install
●●●● (1)                     ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60
●●●●●●● (6)                  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67-70 (5 sites)
●●●●●●●●● (15-20)            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80 (4 sites)
●●●●●●●●●●●● (30) — modal
                              [Average after: 72]
[Average before: 17]

What moved — the audits the kit consistently closed

Across the 10-site cohort, the kit closed six of nine audits with near-100% consistency:

1. `llms-txt-present` — closed on 10/10 sites

The kit's pipeline always emits a llms.txt. Install puts it at the domain root. The audit's pass/fail check resolves green on every install where the file was uploaded to the correct path. Zero variance.

2. `llms-txt-well-formed` — closed on 10/10 sites

The kit validates against the llmstxt.org reference parser at build time. Files that don't parse cleanly don't ship. Zero variance.

3. `agents-json-present` — closed on 10/10 sites

Same shape as llms-txt-present. Pipeline emits the file, install uploads it, audit passes.

4. `agents-json-actions-typed` — closed on 10/10 sites

The kit's typing pass refuses to emit untyped actions. Files that ship pass this audit by construction. This is the audit most-likely-to-fail on hand-written files; the kit closes it deterministically.

5. `agent-runbook-present` — closed on 10/10 sites

Same shape as the other file-presence audits.

6. `auto-discovery-links` — closed on 10/10 sites

The kit ZIP includes the three <link rel="alternate"> tags as a platform-specific snippet (Liquid for Shopify, HTML for WordPress / Webflow, TSX for Next.js, etc.). Sites that completed the install paste it pass; sites that skipped that step fail (and the post-install audit score we recorded assumes the paste was completed).

Six audits, six closures, zero variance. This is the load-bearing finding of the cohort. The audits the kit was designed to close, it closes on every install.

What didn't move — the audits outside the kit's scope

Three audits showed mixed results across the cohort:

7. `sitemap-discoverable` — closed on 6/10 sites

The audit checks two conditions: /sitemap.xml exists and is valid, AND /robots.txt references it via a Sitemap: directive.

6 sites had both conditions met by their existing CMS / hosting setup. The kit didn't need to touch this audit.
4 sites had the sitemap but missing the Sitemap: line in robots.txt. The kit's install README flags this as a one-line CMS edit (Yoast / RankMath dashboard for WordPress, theme robots.txt.liquid override for Shopify, etc.). Whether the audit closed on those 4 sites depends on whether the operator completed the one-line edit. We can't simulate that step in a test environment.

The honest framing: the kit doesn't fix this audit directly because the fix lives in the CMS, not in a file you can upload to the domain root. The kit's install README documents the fix path. Closure rate is operator-execution-dependent.

8. `schema-org-density` — closed on 0/10 sites

None of the 10 sites in the cohort had Schema density that passed the audit pre-install, and the kit doesn't emit Schema markup. So this audit stayed red on all 10 sites after install.

The kit's framing here is honest: Schema density is theme-side, not file-side. The post-install score of 60-80 reflects the score with this audit still failing. Sites that want to push above 80 need to do Schema work in their theme or via a Schema plugin — covered in detail in the schema-org-density on product pages deep-dive and the per-platform pillars in cluster 3.

The 20-point gap between "kit closes" and "max possible" is exactly the schema-org-density weight in the Lighthouse Agentic Browsing scoring. Closing the audit lifts the cohort average from 72 to ~90+.

9. `webmcp-annotations` — closed on 0/10 sites

No site in the cohort shipped WebMCP annotations pre-install, and the kit doesn't generate them for 2026 deliveries because the annotation conventions aren't settled.

This audit weighs 5% in the category scoring. Not closing it costs 4-5 points on the headline score. The kit's framing matches Chrome's own framing: the audit is informational rather than load-bearing in 2026; wait for spec stabilization.

The variance — what explains the 37 to 74 delta range

The smallest lift was 37 points (.SE leisure site, 30 → 68). The largest was 74 points (freelance marketplace, 6 → 80). The 37-point spread reflects three structural factors:

Starting score matters

Sites that started at 1-6 had the most absolute room to lift — every audit the kit closed counted from a baseline of zero. Sites that started at 30 already had auto-discovery-links partially in place (often via a Yoast-emitted Schema block that satisfied half the audit) or had sitemap-discoverable already passing, so the kit's contribution was a smaller fraction of the total.

This is mechanical, not a kit limitation. The 30 → 68 site got the same per-audit lift as the 1 → 70 site; the headline number just looks smaller because the starting point was higher.

Crawl quality matters for `agents.json` action count

Sites where the kit's pipeline got a healthy sitemap crawl (7-8 pages discovered) generated agents.json files with 20-30 actions. Sites with only a single page discovered (bot-defended, no sitemap, or fallback to homepage links) generated thinner manifests with 15-20 actions.

The audit passes either way — agents-json-actions-typed checks that every action present has typing, not that there's a minimum count. So the per-audit score is identical. But the underlying value of the file to an actual agent reading it is different. A bot-defended site with a single-page crawl has a workable file, but it has lower content density than the 8-page-crawl equivalent.

The honest framing: the audit score doesn't fully reflect the kit-output quality. Sites with healthy crawls get better files even when the audit shows equal scores.

Platform pack matters for action quality

The one site in the cohort where the pipeline got a fully-matched Shopify platform pack (springbokretail) had the cleanest agents.json — 6 API shortcuts survived the verify pass (highest in the cohort), 16 actions drafted, every action grounded in real Shopify endpoints. The other sites that detected platforms partially (gymshark detected Next.js + Shopify but didn't fully match the pack) had next-cleanest output. Custom-stack sites with no detection had the broadest action coverage but the least platform-grounded action shape.

All sites passed the audit. Underlying agent-usability varies.

What the cohort doesn't prove

Three claims the cohort data doesn't support, even though they'd be convenient to make:

Claim: "The kit will lift YOUR score by 55 points."

Honest framing: the cohort average is +55. Your site's starting score and ceiling depend on what's already in place (Schema, sitemap discovery, your CMS's defaults). A site with a strong CMS-emitted Schema baseline might see +30 from the kit alone, plus +20 from already-existing Schema, landing at the same 70-85 post-install. The +55 average reflects sites with weak pre-install baselines.

Claim: "All agent traffic improves equivalently to score lift."

The audit measures one dimension of agent-readiness (structured-file presence + typing + discovery). It doesn't measure content quality, action-completion rates, or freshness. Sites that score 80 post-install and ship outdated llms.txt files get worse agent outcomes than sites that score 75 with fresh files. The score is necessary but not sufficient.

Claim: "Bot-defended sites get the same kit quality."

Per the variance analysis above, single-page crawl sites get thinner kits. The audit doesn't reflect that gap. If your site sits behind aggressive bot defenses (Cloudflare with high security level, Akamai, ECS), the kit's output is workable but not optimal — the install path on those sites should include unblocking the kit's crawler before generation, covered in the install matrix.

What the cohort does prove

Three claims the data supports cleanly:

The kit closes six of nine Lighthouse Agentic Browsing audits with near-100% consistency across 10 sites of varied platform and shape. This is the file-layer the kit was designed for. It works.
Average score lift across the cohort is +55 points, with no site in the cohort dropping or staying flat. The kit is value-creating across the tested range; it doesn't have a "doesn't work for this kind of site" failure mode in the sample we tested.
The remaining three audits (sitemap-discoverable, schema-org-density, webmcp-annotations) are scope-bounded out of the kit by design. Sitemap is a one-line CMS edit, Schema is theme work, WebMCP is a spec still moving. Sites that close these audits get 80+ scores; sites that don't get 60-80. Both ranges represent a clean install.

The cohort is small (n=10) and the sites we tested aren't a random sample of the SMB-website population. The data is enough to falsify the "the kit doesn't work" claim; it isn't enough to power statistically rigorous segment analysis. A larger cohort of paying customers will produce that data over the next quarter — until then, the n=10 cohort is the honest representation of what the kit delivers.

What to do with this

If you're evaluating the kit, the three useful questions:

What's your starting score? Run the free Lighthouse Agentic Browsing audit on your site. If it's under 30, you're in the cohort's modal pre-install range and the +55 lift is a reasonable expectation. If it's already 50+, your lift will be smaller because there's less to fix.
What's your ceiling? If you're on Shopify with a Dawn-derived theme, your Schema density is probably already strong — the kit + your existing Schema can push you to 85+. If you're on a stripped theme or custom stack, you'll cap at 65-75 from the kit alone unless you also do Schema work.
What does your traffic mix need? Transactional agents (Atlas, Operator) care most about agents.json typing — the kit closes that audit. Citation agents (Claude, Perplexity) care most about Schema density — the kit doesn't touch that. The agent traffic landscape post covers the per-agent reading shape.

The cohort data is one input. The empirical free audit on your actual site is the more useful input.

Lighthouse Agentic Browsing — every audit, every fix → — the canonical per-audit fix reference, including the audits the kit doesn't close
How to read your Lighthouse Agentic Browsing score → — what each score range predicts and what the per-audit detail reveals
How the BridgeToAgent kit maps to every audit → — companion post on which audits the kit closes vs reports
schema-org-density on product pages → — the audit the kit doesn't close that drives the gap between 72 and 90+
The 2026 agent-traffic landscape → — which agents drive traffic and what each one reads
Agentic Kit install matrix → — install paths across 13 platforms (and which platforms need pre-install bot-defense unblocking)
Run the free readiness audit → — five-second check on your site

We measured Lighthouse Agentic Browsing scores on 11 sites before and after kit install — here's what moved and what didn't

We measured Lighthouse Agentic Browsing scores on 11 sites before and after kit install — here's what moved and what didn't

Methodology — what we measured and what we didn't

The score table

What moved — the audits the kit consistently closed

1. `llms-txt-present` — closed on 10/10 sites

2. `llms-txt-well-formed` — closed on 10/10 sites

3. `agents-json-present` — closed on 10/10 sites

4. `agents-json-actions-typed` — closed on 10/10 sites

5. `agent-runbook-present` — closed on 10/10 sites

6. `auto-discovery-links` — closed on 10/10 sites

What didn't move — the audits outside the kit's scope

7. `sitemap-discoverable` — closed on 6/10 sites

8. `schema-org-density` — closed on 0/10 sites

9. `webmcp-annotations` — closed on 0/10 sites

The variance — what explains the 37 to 74 delta range

Starting score matters

Crawl quality matters for `agents.json` action count

Platform pack matters for action quality

What the cohort doesn't prove

What the cohort does prove

What to do with this

Related

The Law of Least Token Resistance — why AI agents recommend agent-ready stores

WebMCP for store owners — what it is and what to do about it

UCP is Shopify-native — here's what the Agent Kit adds on top

We measured Lighthouse Agentic Browsing scores on 11 sites before and after kit install — here's what moved and what didn't

Methodology — what we measured and what we didn't

The score table

What moved — the audits the kit consistently closed

1. llms-txt-present — closed on 10/10 sites

2. llms-txt-well-formed — closed on 10/10 sites

3. agents-json-present — closed on 10/10 sites

4. agents-json-actions-typed — closed on 10/10 sites

5. agent-runbook-present — closed on 10/10 sites

6. auto-discovery-links — closed on 10/10 sites

What didn't move — the audits outside the kit's scope

7. sitemap-discoverable — closed on 6/10 sites

8. schema-org-density — closed on 0/10 sites

9. webmcp-annotations — closed on 0/10 sites

The variance — what explains the 37 to 74 delta range

Starting score matters

Crawl quality matters for agents.json action count

Platform pack matters for action quality

What the cohort doesn't prove

What the cohort does prove

What to do with this

Related

Keep reading

The Law of Least Token Resistance — why AI agents recommend agent-ready stores

WebMCP for store owners — what it is and what to do about it

UCP is Shopify-native — here's what the Agent Kit adds on top

1. `llms-txt-present` — closed on 10/10 sites

2. `llms-txt-well-formed` — closed on 10/10 sites

3. `agents-json-present` — closed on 10/10 sites

4. `agents-json-actions-typed` — closed on 10/10 sites

5. `agent-runbook-present` — closed on 10/10 sites

6. `auto-discovery-links` — closed on 10/10 sites

7. `sitemap-discoverable` — closed on 6/10 sites

8. `schema-org-density` — closed on 0/10 sites

9. `webmcp-annotations` — closed on 0/10 sites

Crawl quality matters for `agents.json` action count