Catch a silent selector rename in scraped HTML
Scrape an HTML page into a structured object with transform, then drift the structured shape so a markup or selector rename becomes a loud contract error instead of a silently missing field.
Task
You scrape a page that has no API — a catalog provider returns HTML, and you
parse it into a structured object with a selector. The danger is silent: the
provider renames a CSS class or moves a cell, your selector quietly stops
matching, and the field comes back undefined. Nothing throws; downstream code
keeps running on corrupted data until someone notices a ranking is wrong. You
want that markup rename to surface the moment it happens, as a loud error on the
structured shape — not the raw HTML.
Example
The response pipeline is transform → unwrap → validation, so transform
reshapes the HTML into a structured object before drift ever runs — drift
then compares the parsed value, not the raw markup. List the path that must
not move under critical.
import { , } from 'stitchapi';
import { } from 'zod';
// A trivial hand-rolled scraper: the `score` selector keys off `td.score` —
// exactly the class a markup rename silently breaks.
function (: unknown): { : <string, unknown>[] } {
const = ();
const = .(/<tr[^>]*class="row1"[^>]*>/i).(1);
const = .(() => {
const : <string, unknown> = {};
const = /<td[^>]*class="title"[^>]*>([^<]+)<\/td>/i.();
const = /<td[^>]*class="score"[^>]*>\s*(\d+)\s*<\/td>/i.(
,
)?.[1];
if () ['title'] = [1]!.();
if ( !== ) ['score'] = (); // omitted when the selector misses
return ;
});
return { };
}
const = ({
: 'https://api.example.com',
: '/catalog',
: , // HTML string -> { items: [...] }
: 'items',
: (
.(.({ : .(), : .().() })),
{
// `[].score` is a path on the STRUCTURED shape, not the raw HTML.
: ['[].score'],
: 'catalog.contract.json',
},
),
});
const = await ();How it works
The first call records the baseline into catalog.contract.json (check it into
the repo) from the transformed shape — [{ title, score }], not the HTML
body. Every call after parses the page and compares that structured value
against the snapshot.
When the provider renames the score cell's class (score → rank), the
selector stops matching and scrape drops score from each item. Because
[].score is critical, drift emits an error-level finding —
{ level: 'error', change: 'missing', path: '[].score' } — which breaks the
contract and throws STITCH_DRIFT. The path is the
proof: [].score exists only on the parsed object, so a finding on it can only
come from drift inspecting the transform output, not the raw markup. The
silent undefined is now a loud, immediate failure.
warn and info findings (a non-critical field changing, a new field
appearing) don't throw — they ride the
event stream as drift events, so you can
watch a field erode before it
breaks. Reserve critical for the few paths whose loss corrupts your data.
Ship it read-only in production
The first call with no committed snapshot writes one and reports nothing — a
convenience while you develop the contract, but a footgun in a deployed run,
where the first request would silently persist a baseline and "pass" instead of
detecting drift. Generate the baseline once, commit catalog.contract.json,
then ship with readonly: true so drift detects but never writes:
const = ({
: 'https://api.example.com',
: '/catalog',
: ,
: 'items',
: (
.(.({ : .(), : .().() })),
{
: ['[].score'],
: true, // detect-but-never-write
: 'error', // a missing baseline fails CI, not slips past
: 'catalog.contract.json',
},
),
});Anti-pattern: don't drift the raw HTML body and skip transform.
Snapshotting the markup makes every cosmetic edit — a reordered attribute, a
whitespace change — look like drift, burying the one rename that matters.
Parse to the structured shape first, then drift the few critical fields
you actually depend on.
See also
Catch a breaking API change before your users do
Wrap output with drift to compare every response against a committed snapshot and flag dropped fields, type flips, and new keys by severity.
Watch retries and throttling as they happen
Iterate a stitch's event stream instead of awaiting it, and observe every retry, pause, and drift in real time.