How to Generate sitemap.xml for a Static Site

June 19, 2026 · 11 min read · How-to

Search engines discover pages by crawling links, but a sitemap.xml file accelerates indexing—especially for new domains without backlinks. For static sites, the sitemap is just another file in your repo root alongside index.html. Google and Bing publish sitemap protocols specifying URL entries, optional last modified dates, and changefreq hints. This how-to shows how to generate an accurate sitemap for a hand-maintained HTML project and keep it updated as you add guides and landing pages.

Understand the required format

A valid sitemap is XML with a urlset root namespace per sitemaps.org. Each public canonical URL gets one <url> block with at least <loc>. Use absolute HTTPS URLs that match your link rel="canonical" tags—never mix www and non-www. Exclude pages you do not want indexed: staging hosts, thank-you pages with no unique value, or admin paths if any exist.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-06-19</lastmod>
  </url>
  <url>
    <loc>https://example.com/templates/</loc>
    <lastmod>2026-06-15</lastmod>
  </url>
</urlset>

Inventory every public URL

Walk your project tree and list HTML files served to anonymous visitors. Include hub pages (/guides/, /how-to/), detail articles, legal pages, and template catalog entries. Exclude 404.html, demo iframes you noindex, and duplicate language paths if hreflang handles alternates separately. Cross-check against header/footer navigation so orphan pages are either linked or deliberately omitted from the sitemap.

Generate with a small Python script

Manual XML works for ten URLs; automation prevents drift at fifty. A script scans for .html files, maps paths to URLs, and writes sitemap.xml at the repo root.

# scripts/build_sitemap.py
from pathlib import Path
from datetime import date

BASE = "https://example.com"
ROOT = Path(__file__).resolve().parent.parent
SKIP = {"404.html"}

def to_url(path: Path) -> str:
    rel = path.relative_to(ROOT).as_posix()
    if rel == "index.html":
        return f"{BASE}/"
    return f"{BASE}/" + rel

urls = []
for html in sorted(ROOT.rglob("*.html")):
    if html.name in SKIP or "demos/" in html.as_posix():
        continue
    urls.append(to_url(html))

lines = ['<?xml version="1.0" encoding="UTF-8"?>',
         '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">']
for u in urls:
    lines.append("  <url>")
    lines.append(f"    <loc>{u}</loc>")
    lines.append(f"    <lastmod>{date.today().isoformat()}</lastmod>")
    lines.append("  </url>")
lines.append("</urlset>")
(ROOT / "sitemap.xml").write_text("\n".join(lines), encoding="utf-8")

Customize SKIP and folder exclusions for your structure. Run the script before deploy or add it to a Makefile. Commit the generated sitemap.xml so Cloudflare Pages serves it without a dynamic build.

Reference the sitemap in robots.txt

robots.txt lives at the site root and tells crawlers where to find your sitemap. Google documentation recommends announcing the sitemap URL explicitly.

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Keep robots.txt permissive for marketing sites unless you block specific paths (for example, unfinished /drafts/). A disallow typo can hide your entire launch from search.

Submit to search consoles

After deploy, open Google Search Console, add your domain property, verify DNS or HTML tag, then submit the sitemap URL under Sitemaps. Repeat in Bing Webmaster Tools. Initial status may show "Couldn't fetch" until DNS propagates—retry later the same day. Monitor Coverage for excluded URLs that should be indexed; often they lack internal links or carry accidental noindex meta tags.

Maintain lastmod honestly

Google's guidance treats lastmod as a hint when it reflects real content changes. Updating every date daily without edits wastes crawl budget trust. Tie lastmod to git commit timestamps per file if you want precision, or update manually when you materially revise an article. Do not stuff changefreq and priority tags—they are largely ignored on modern Google crawls.

All canonical HTTPS URLs included
No staging or noindex pages listed
robots.txt points to sitemap.xml
Submitted in Search Console after domain cutover
Regenerate sitemap when adding new HTML files

Scale beyond 50,000 URLs

Sitemap protocol limits each file to 50,000 URLs and 50 MB uncompressed. Large programmatic SEO projects split into multiple sitemaps and index them with a sitemap index file. Typical LaunchStatic marketing sites stay far below that threshold—one file is enough.

Handle trailing slashes and duplicates

List only canonical URLs. If /guides/ and /guides/index.html both resolve, pick the URL your canonical tags use—usually the directory form with trailing slash on Cloudflare Pages. Do not include both in the sitemap; duplicate loc entries waste crawl budget and confuse reporting. When you migrate hosts, regenerate the sitemap the same deploy you update redirects.

Automate in CI

Add python scripts/build_sitemap.py to your release checklist or GitHub Action before deploy. Fail the workflow if the script errors—missing BASE constant or empty urlset should block production. Commit the generated artifact so preview environments without Python still serve a correct sitemap copied from main.

Validate XML output

Malformed XML prevents parsing entirely. Escape ampersands in URLs if query strings appear. Validate with xmllint --noout sitemap.xml on macOS/Linux or an online schema checker. A single unescaped character blocks the whole file—worse than omitting a minor page. After first production deploy, fetch https://example.com/sitemap.xml in a browser and confirm Content-Type is XML, not HTML error pages disguised as 200 responses from SPA misconfiguration (not an issue on pure static hosts).

Ping Search Console after major URL structure changes—adding an entire /how-to/ hub warrants a fresh sitemap submission even if the file path stayed /sitemap.xml.

Keep a SITEMAP_BASE constant at the top of your generator script. Forgetting to swap staging hostnames into production is the most common sitemap bug we see in indie launches.

Add the sitemap URL to your README so collaborators know where discovery files live. Onboarding docs prevent duplicate manual sitemaps drifting out of sync with generated ones.

When you noindex a page with robots meta, remove it from the sitemap the same commit—mixed signals slow crawlers down.

Archive old sitemap copies in git history if you need to audit indexing issues months after a redesign.

Do I need a sitemap for a one-page site?

Helpful but not mandatory. A single URL can still be indexed via links. A one-entry sitemap is cheap insurance at launch.

Should PDFs go in sitemap.xml?

Yes if they are public marketing assets you want indexed. Use the same urlset format with the PDF URL.

Can I use a sitemap generator website?

You can, but a repo script stays in sync with git and avoids uploading your structure to third parties.

Why does Search Console show sitemap errors?

Common causes: HTTP instead of HTTPS, 404 on sitemap.xml, or URLs blocked by robots.txt/noindex.

Launch with SEO basics covered

Templates include sensible meta tags—add your sitemap and submit once you deploy.

Browse templates Add structured data