← Back to Blog SEO

Sitemap and Canonical Fixes: Stopping a 371-URL PageRank Leak

Two small infrastructure mistakes can quietly bleed a marketing site for months. We found two of them this week, both on the same axis — the URLs we tell Google about versus the URLs that actually exist — and the combined leak was meaningful enough to warrant writing down.

The 88KB static sitemap

The first finding came from a routine audit of Search Console crawl errors. We had been seeing a slow drip of 404 reports against URLs ending in .htmlrestoration-services.html, case-studies-hotel-flood.html, dozens like it — that nobody on the team recognized. Those routes do not exist on the live site. The live site has used extensionless routing since the Astro migration in March 2026.

Where were they coming from?

A find against the repo answered the question in one shot. There was an apps/website/sitemap.xml — an 88KB static file, checked in to the repo at some point in 2023, never touched since. Three hundred and seventy-one <url> entries, every single one of them ending in .html. We had been serving this file at the root of the marketing site for a year and change, and Google had been faithfully crawling it the entire time.

Three hundred and seventy-one URLs we told Google to crawl. Zero of them resolved. Every fetch returned a 404. Every 404 was a small signal of poor site quality, a small drag on the perceived health of the host, and a small leak of crawl budget away from the URLs that actually mattered.

The canonical-vs-host mismatch

The second finding was subtler. Our <link rel="canonical"> tags pointed at https://proofco.ai/... — the apex host, no www. The server, meanwhile, was configured to 301 from proofco.ai to www.proofco.ai. We had been doing this for almost as long as the site has existed.

The mechanics: a Googlebot resolves the canonical, follows the redirect, lands on the www host. Google then has to decide which URL to consolidate — the canonical it was told to honor, or the resolved final URL. In practice, Google is forgiving here; it usually picks the resolved host. But “usually” is doing a lot of work. Every redirect is an opportunity for signal dilution, every mismatch is a question the crawler has to answer, and every question is friction at scale.

For a small site this is noise. For a site with hundreds of vertical and locale pages — restoration city pages, comparison pages, learn guides — it is a steady, accumulating drag on consolidated PageRank.

The fix, in five layers

We treated the sitemap and canonical issue as the same problem viewed from two angles: the URLs we declare must be the URLs that resolve. The fix went in five coordinated changes.

1. Tombstone the static sitemap

The old apps/website/sitemap.xml is now a single-entry <sitemapindex> that points at the live, generated sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://www.proofco.ai/sitemap-index.xml</loc>
  </sitemap>
</sitemapindex>

We did not delete the file. Deleting it would have returned 404 for any cached Google reference to it. Tombstoning it preserves the URL, returns a valid sitemap-index response, and redirects all subsequent attention to the generated index. The 371 dead .html URLs vanish from Google’s view on the next crawl.

2. Expand sitemap-core.xml.ts to cover every real route

The generated sitemap had been missing entire route families. state-of-restoration was absent. The free tools index and each tool’s deep page were absent. Case studies, learn guides, comparison pages, and the dynamic blog enumeration were absent. We expanded the generator to enumerate every public route — pulling vertical and locale pages from getVerticals(), blog entries from getCollection("blog"), tools and case studies from their local data arrays — and emit each with a real <lastmod> from the source content.

A sitemap is data, not a build artifact. Treating it as data — generated fresh on every deploy from the same sources the routes are built from — means it cannot drift. If a route exists, it is in the sitemap. If it is in the sitemap, it resolves.

3. Create the missing IA index pages

A second-order problem surfaced as we wrote the enumeration: several index pages we wanted to include did not exist. /tools/, /case-studies/, /learn/, and /compare/ were all conceptually present — the underlying detail pages existed and were linked from elsewhere — but the directory landing pages themselves had never been built. So we built them.

Each new index page is data-driven from the same source the sitemap pulls from. Each emits ItemList JSON-LD so Google can read the structure directly. Each carries the canonical for its own URL. None of them are vanity pages; they exist to give crawlers (and humans) a single entry point per content category.

4. Align canonicals to www.proofco.ai

Every canonical tag the site emits — from SEOPage.astro, from blog posts, from city pages — now resolves to https://www.proofco.ai/. We updated the same constant in five generator files (sitemap-index.xml.ts, sitemap-core.xml.ts, sitemap-hubs.xml.ts, sitemap-tools.xml.ts, sitemap-verticals.xml.ts) and in robots.txt.ts. One BASE constant, propagated.

The redirect itself stays in place; the apex still 301s to www. But Google never has to follow it anymore, because we never declare the apex as the canonical. The redirect is a courtesy for users who type the apex directly, not a signal-routing exercise for crawlers.

5. Ship a layered vercel.json cache policy

While we had the infrastructure files open, we fixed the cache policy. The old configuration was a single blanket header — fresh enough that we leaned on Vercel’s edge defaults for everything. The new one is layered:

  • /_astro/*public, max-age=31536000, immutable. Astro fingerprints these.
  • Static images / fontspublic, max-age=31536000, no immutable (they get refreshed occasionally).
  • /api/*, sitemaps, robots.txtno-store or s-maxage=60, stale-while-revalidate=300. Fresh.
  • HTML routess-maxage=60, stale-while-revalidate=300. Edge-cached briefly, served stale during revalidation.

Plus the security headers we should have had on day one: X-Content-Type-Options: nosniff, Referrer-Policy: strict-origin-when-cross-origin, Permissions-Policy denying camera/microphone/geolocation by default, and X-Frame-Options: SAMEORIGIN.

Two lessons worth keeping

There are two principles in here that are easy to forget and expensive to relearn.

A sitemap is data, not a build artifact. The instant a sitemap is checked in as a static file, it begins decaying. The route layer keeps moving — extensions get dropped, slugs get renamed, the IA gets reorganized — and the file does not. Always generate sitemaps from the same source of truth the routes are built from, on every deploy, with no human in the loop. If a human has to remember to update the sitemap, the sitemap will be wrong.

Canonical URLs must match the resolved host. This one is mechanical. The <link rel="canonical"> you ship must be the URL the server will hand back at the end of all redirects. If your apex 301s to www, your canonical is www. If you serve over https only, your canonical is https. If you drop the trailing slash, your canonical drops the trailing slash. Every redirect between the canonical and the final URL is a small bill, and it compounds.

We will not catch every leak the first time. Two weeks from now Search Console will show us the next one, and we will write down that fix too. SEO infrastructure is a maintenance discipline, not a project. The point is to keep the URLs you declare and the URLs that resolve in lockstep, forever, on every deploy.

Ready to see AI in action?

Proof AI handles dispatch, documentation, vision, and operations — so you can focus on the work.

Get Started →

Proof AI Assistant

Online
Hi! I'm Proof AI. How can I help you today?