BridgeToAgent
Explainer10 min read

What free llms.txt generators don't tell you — and how to spot a bad file before it lowers your Lighthouse score

Free llms.txt generators produce a file in 30 seconds and a Lighthouse score worse than no file at all on a meaningful share of sites. Six failure modes that show up in 80% of free-generator output — link rot, stale crawls, malformed markdown, hallucinated URLs, placeholder summaries, and content that violates the llmstxt.org parser. Each one with the spot-check that catches it before the file ships and downrates your agent-readiness audit.

BridgeToAgentEditorial team

What free llms.txt generators don't tell you — and how to spot a bad file before it lowers your Lighthouse score

A llms.txt file that fails Lighthouse's llms-txt-well-formed audit scores worse than no file at all. The audit weighs a malformed file against the llms-txt-present baseline — you lose points for having the file when it's broken, on top of failing the well-formedness check. Free generators routinely produce files that fail the audit, and the failures are predictable.

This post catalogs the six failure modes that show up in roughly 80% of free-generator output we've audited (across SiteSpeakAI, LLMGenerator.org, llmstxtgenerator.org, the free WordPress AI Readiness plugin, and three smaller tools). Each one is paired with the spot-check that catches it in 30 seconds before the file ships. The post is generator-class-neutral on purpose — naming specific vendors invites legal pushback that distracts from the educational content. The patterns generalize.

The audience for this post: SMB site owner who used a free generator, ran Lighthouse, saw a worse score than before, and is now trying to figure out what went wrong. The honest path from "free generator file in hand" to "audit-passing file in production" is in here.


Why this happens — the structural reason free generators ship broken files

Free generators are built around a 30-second user experience: paste URL, wait, download file. The constraint that produces both their UX win and their failure mode is the same — the generator can't crawl your full sitemap, validate every URL it discovered, re-parse the resulting file against the llmstxt.org spec, and re-run the loop on errors, all within 30 seconds. So they skip the validation step.

Paid tools (including the kit) trade the 30-second UX for a 2-3-minute crawl + validate + re-emit pipeline. The economics of free generation can't support that pipeline at scale — every paid validation costs server-time per file. So free generators ship the first-pass output, broken or not, and the audit catches the breaks.

This isn't a bug in any single generator. It's the structural shape of "free + 30 seconds." Once you see the shape, the failure modes get predictable.


Failure mode 1 — Link rot in generated URLs

What it looks like. Your generated llms.txt contains a section like:

## Pages

- [Homepage](https://yourdomain.com/) — main landing page
- [Old Blog Post](https://yourdomain.com/blog/2022/some-post) — discontinued post
- [Pricing](https://yourdomain.com/old-pricing) — outdated pricing page

Two of the three URLs 404. The generator scraped your site at some point — possibly months ago, possibly using an outdated sitemap — and the URLs in the file no longer resolve.

Why the audit catches it. The reference parser doesn't probe URLs, but Lighthouse Agentic Browsing's llms-txt-well-formed check does — it samples URLs from the file and verifies they return 200. URLs that 404 mark the file as malformed.

The 30-second spot check. Pull every URL from your llms.txt and HEAD them:

grep -oE 'https?://[^)]+' llms.txt | xargs -I{} curl -s -o /dev/null -w "%{http_code} {}\n" {}

Any URL returning anything other than 200 is broken. Remove it from the file or replace it with the canonical current URL.

Why free generators fall here. They cache crawls or rely on sitemap-shaped data that's months stale. URLs that worked at crawl time are gone.


Failure mode 2 — Hallucinated URLs

What it looks like. Your generated file contains URLs to pages that never existed on your site:

## Pages

- [Customer Stories](https://yourdomain.com/customer-stories) — case studies
- [Product Tour](https://yourdomain.com/product-tour) — walkthrough
- [Pricing](https://yourdomain.com/pricing) — pricing tiers

Two of the three are pages an agent would expect a typical SaaS to have, but your site doesn't actually have them at those paths. Returning 404 from the audit's URL probe.

Why the audit catches it. Same probe as failure mode 1 — Lighthouse samples and HEAD-checks. URLs that don't resolve count as broken.

The spot check. Same curl loop. Audit every URL the generator emitted, not just the ones you recognize.

Why this happens. Generators that use LLM-driven extraction can hallucinate URLs that "fit" the site's apparent shape. They look at "this is a SaaS landing page" and emit URLs for paths a generic SaaS would have, regardless of whether those paths exist on your specific site. Less common than link rot, but more dangerous because the URLs look plausible.


Failure mode 3 — Placeholder or generic summaries

What it looks like. Each URL in your file has a one-line summary, but the summaries are useless:

## Pages

- [Homepage](https://yourdomain.com/) — main page
- [About](https://yourdomain.com/about) — about page
- [Contact](https://yourdomain.com/contact) — contact page

These match the URL pattern, contain zero information about what's actually on each page, and add no signal for the agent reading the file.

Why the audit catches it. The reference parser technically accepts these — they parse fine as Markdown. But Lighthouse's content-density scoring (which feeds into the overall Agentic Browsing score, not just llms-txt-well-formed) marks low-signal files down. The summaries are also load-bearing for agents: an agent reading the file uses the summary to decide whether to fetch the URL. "Main page" tells the agent nothing.

The spot check. Read each summary out loud. If it doesn't say something specific about what's on that page, it's a placeholder. Replace with content-specific summaries.

For example:

  • [Homepage](https://example.com/) — main page
  • [Homepage](https://example.com/) — overview of our handmade leather wallets and bags, with size guide and shipping info

The good summary is what the page is about, not what the URL is. Aim for 5-15 words of content-specific signal per URL.


Failure mode 4 — Malformed Markdown link syntax

What it looks like. Subtle syntax errors that look correct on quick read but fail the parser:

## Pages

- (Homepage)[https://yourdomain.com/] — main page          ← parentheses and brackets reversed
- [About]https://yourdomain.com/about — about              ← missing parens around URL
- [Contact](https://yourdomain.com/contact — missing closing paren
- [Blog](<https://yourdomain.com/blog>) — angle-bracket link

All four are common Markdown mistakes. Generators that build the file from templates with simple string substitution emit these regularly — especially when summary text contains parens, quotes, or commas that break their templating.

Why the audit catches it. The llmstxt.org reference parser is strict on Markdown link syntax. The exact shape is [link text](url) — no variations, no angle brackets, no missing parens.

The spot check. Paste your file into the llmstxt.org reference parser. It tells you the line and column of the first syntax error. Fix and re-validate.

If you don't want to paste into a hosted parser, validate locally with any Markdown linter (markdownlint, remark-lint) — they catch the same issues.


Failure mode 5 — Missing required sections

What it looks like. Your file has URL listings but skips the spec's required preamble:

- [Homepage](https://yourdomain.com/) — main page
- [About](https://yourdomain.com/about) — team

No H1, no description blockquote, no section headers. Just bullet points.

Why the audit catches it. The spec requires:

  • An H1 declaring the site (# Your Site Name)
  • An optional but expected blockquote one-line description (> What your site does)
  • At least one section header (## Pages, ## Documentation, etc.) before the URL list

Files without the H1 fail the parser outright. Files without section headers parse but score lower on the content-density check.

The spot check. Open your file. First non-blank line should be # Something. If it isn't, your generator skipped the preamble. Add:

# Your Site Name

> One-sentence description of what your site does and who it's for.

## Pages

- [Homepage](https://yourdomain.com/) — ...

The H1 and blockquote are 30 seconds of editing. The file goes from "fails parser" to "passes parser" with this single fix.


Failure mode 6 — Generic content that doesn't reflect your actual site

What it looks like. Your file pattern-matches what a llms.txt file should look like, but the content is suspiciously generic:

# Your Site

> A modern platform that delivers value to customers.

## Pages

- [Homepage](https://yourdomain.com/) — main landing page
- [Features](https://yourdomain.com/features) — product features
- [Pricing](https://yourdomain.com/pricing) — pricing information
- [Blog](https://yourdomain.com/blog) — articles and updates
- [Contact](https://yourdomain.com/contact) — get in touch

The shape is right. Every URL resolves (the generator only emitted URLs it confirmed existed). Every summary parses. But nothing in the file tells an agent that you sell handmade leather wallets, or that you're a B2B compliance SaaS, or that you teach piano. It's the platonic ideal of a llms.txt file with all the signal stripped out.

Why the audit catches it. The llms-txt-well-formed parser doesn't catch this — the file is syntactically clean. The content-density audit notices low-signal content. More importantly, agents reading the file derive zero useful context from it; they can't answer questions about your business from this content alone.

The spot check. Read your file as if you were an AI agent trying to understand the site for the first time. If after reading you can't say what the site does, who it's for, and what's worth reading first — the file is failing its actual purpose, regardless of audit score.

The fix is editorial, not technical. Replace the generic description with one specific to your business. Replace generic summaries with content-specific summaries (per failure mode 3). The file should read like a thoughtful human introduction to your site, not like a SEO-template.


The cumulative effect on your Lighthouse score

A typical free-generator-produced file has 2-3 of these six failure modes simultaneously. The audit-score impact compounds:

  • llms-txt-well-formed fails outright on syntax errors (failure modes 4, 5) → -10 to -15 points on Agentic Browsing category
  • llms-txt-well-formed fails on broken URLs (failure modes 1, 2) → another -5 to -10 points
  • Low content density on generic content (failure modes 3, 6) → -3 to -5 points

A site with a broken free-generator file can score worse than a site with no llms.txt at all. The audit's logic: a present-but-broken file signals you tried and failed; an absent file signals you haven't gotten to it yet. The math of the category scoring penalizes the former more than the latter.

This is the "free files lower your score" problem in concrete numbers. It's not theoretical.


The fix path — three options ranked by effort

Option 1: Audit and repair the existing file

Time: 1-2 hours for a small site, longer for sites with 50+ pages in the file.

Run each of the six spot-checks against your file. Fix what's flagged. Re-validate against the parser. Re-run Lighthouse. The audit should flip from red to green if all six failure modes are addressed.

The trap with this option: incremental fixes often surface new issues. A file that fails 3 of 6 checks today might fail 2 of 6 after the first round of fixes plus 1 newly-discovered. Budget for 2-3 iterations.

Option 2: Hand-write from scratch

Time: 2-3 hours for a 20-page site, scaling with site size.

The spec at llmstxt.org is short. Start with the H1 + blockquote, list your canonical pages with content-specific summaries, validate against the parser, ship. Hand-written files are the cleanest because every line went through your editorial judgment.

The trap: maintenance. Hand-written files drift from the live site as pages get added/renamed/removed. Schedule a quarterly review.

Option 3: Generate with a validating pipeline

Time: 2 minutes per generation, ongoing $49 one-time or comparable subscription cost.

The BridgeToAgent kit (run the audit) generates a llms.txt from a real sitemap-driven crawl, validates against the parser at build time, and refuses to ship a file that doesn't pass. The other two kit files (agents.json and agent-instructions.md) come with the same package, plus the platform-specific install instructions.

This is the option for site owners who don't want to maintain the file editorially. The tradeoff: less editorial control over per-URL summaries. The kit's content-specific summaries are derived from page content, not hand-written — usually fine, occasionally generic enough that you'll want to edit a few after delivery.


What to take away from this post

Free generators aren't malicious — they're constrained by the economics of free + fast. The output they produce is good enough to look right and not good enough to pass Lighthouse cleanly. That's the gap this post catalogs.

If you used a free generator and your Lighthouse score went down, you're not alone — this is the modal experience in 2026. The fix is either editorial work on the generated file (Option 1), starting over with the spec (Option 2), or a generation pipeline that validates before shipping (Option 3). Don't keep the broken file in place hoping the audit will eventually overlook it — the audit's scoring shape ensures it won't.

Whatever path you pick, the spot-checks above will keep you out of the trap on future regenerations.


Related

All posts →