local-first web evidence capture

Web evidence agents can cite.

BitterCrawl is the web intake valve for Bitter: fetch a page, extract clean markdown, preserve the source, and emit a crawl receipt every future run can audit. No account, API key, credit meter, or hosted toll booth in the default path.

See the CLI Why it exists

crawl.read local-static

$ bitter crawl example.com --json

{
  "schema": "bitter.crawl.result.v0",
  "provider": "local-static",
  "status": 200,
  "artifacts": {
    "markdown": "page.md",
    "raw_html": "raw.html",
    "receipt": "receipt.json"
  },
  "hashes": {
    "markdown_sha256": "cd9a7ad6..."
  }
}

page.md raw.html receipt.json

01 / why

No toll booth in front of an agent's web sense.

Agents need to open docs, inspect competitors, read changelogs, verify pricing pages, compare examples, and preserve evidence. That should be as ordinary as reading a local file or checking Git.

Hosted scrape APIs are useful escalation paths. They should not be the default economics of Bitter's research loop. BitterCrawl starts with local fetch and extraction, then records what happened so a later run can reuse and audit the source.

for builders

Read primary sources without slowing down the loop.

Docs, READMEs, changelogs, pricing pages, and competitor copy become local artifacts agents can reuse.

for operators

Turn web claims into inspectable evidence.

Every important fetch leaves behind the URL, timestamp, hashes, and files needed to reconstruct the claim.

Local by default

Normal public pages should not require an account, API key, credit balance, or hosted dependency.

Receipt-bearing

The durable output is not just markdown. It is a proof path: URL, fetch, artifacts, policy, and hashes.

Provider-pluggable

Static fetch first. Crawl4AI, browser rendering, and hosted providers become explicit adapters.

02 / cli

A small command surface with a stable envelope.

The first wedge is deliberately narrow: one URL in, markdown and a receipt out. Human commands and agent method calls should converge on the same `crawl.read` primitive.

default read

bitter crawl example.com

Fetch a page, extract main content, write local artifacts, and print useful markdown.

agent envelope

bitter crawl example.com --json

Return `bitter.crawl.result.v0` for agents, scripts, and future method graph calls.

explicit custody

bitter crawl docs.example.com --out .bitter/crawl/runs

Control where run directories, raw HTML, headers, markdown, and receipts are written.

adapter choice

bitter crawl app.example.com --provider browser

Escalate only when a page needs rendering, interaction, or a configured hosted provider.

03 / receipt

The receipt is the product charter.

A useful crawl should answer the questions a future agent will ask: what URL was read, when, by which adapter, what changed on redirect, what was saved, and which hashes identify the evidence.

schemabitter.crawl.receipt.v0

fetchrequested URL, final URL, status, content type, elapsed time

artifactspage.md, raw.html, headers.json, result.json, receipt.json

hashesmarkdown_sha256 and html_sha256

policyuser agent, cache status, robots posture

why this matters

Summaries are cheap to produce and easy to misremember. Receipts make a web claim reconstructable without trusting the agent's prose.

04 / path

Start boring. Add power where evidence demands it.

now

local-static

Native fetch, Readability, Turndown, markdown, raw artifacts, receipt.

crawl4ai

Optional local adapter for richer extraction when users install it.

later

browser

Crawlee or Playwright for rendered pages, screenshots, and sessions.

explicit

hosted

Firecrawl, Jina, or Bitter-hosted workers as configured fallback paths.

05 / faq

The default is local. The escape hatches stay explicit.

Is BitterCrawl a hosted crawler?

Not first. The first version is a local-first CLI primitive that fetches a page, extracts markdown, writes artifacts, and emits a receipt. Hosted crawling belongs behind explicit provider flags.

Is this a Firecrawl replacement?

It is a different default. Firecrawl is a useful compatibility adapter and benchmark. BitterCrawl is the custody layer: evidence, cache discipline, source artifacts, and receipts for Bitter's agent loop.

What happens on JavaScript-heavy pages?

The local-static adapter covers normal documents first. Browser rendering, Crawl4AI, and hosted providers become explicit escalation paths when static fetch is not enough.

06 / first wedge

Coming first as a local command.

BitterCrawl Cloud can wait. The immediate product promise is simple: every agent can read a public page and leave a receipt.

$ bitter crawl example.com --json