Changelog
Release history for feedstock
v0.1.2
Change tracking and dogfood validation
- Change tracking —
ChangeTrackerdetects new/changed/unchanged/removed pages between crawl runs using SHA-256 content hashing and LCS-based text diffing - Snapshot management:
listSnapshots(),deleteSnapshot(),pruneOlderThan() - Text diffs with addition/deletion counts and grouped chunks
- Configurable: diff markdown vs HTML, max diff chunks, custom DB path
- 115 dogfood checks against real websites (example.com, Hacker News, Wikipedia)
v0.1.1
Agent-browser features and fetch-first engine system
Engine System
- Fetch-first architecture —
FetchEngine(HTTP) tried beforePlaywrightEngine(browser) - Auto-escalation: detects SPA shells (React, Next.js, Nuxt) and switches to browser
EngineManagerwith quality-scored engine selection and fallback chainlikelyNeedsJavaScript()heuristic for SPA detection
Accessibility Snapshots
buildStaticSnapshot()— Cheerio-based semantic tree extraction (works with any engine)takeSnapshot()— CDP-basedAccessibility.getFullAXTreefor browser-precise trees- Node categorization: interactive (button, link, input), content (heading, paragraph, img), structural (filtered)
@eref system for deterministic element identification- New
snapshotfield onCrawlResult, enabled viaconfig.snapshot = true
Rich Metadata (50+ fields)
- Full Open Graph (12 fields), Twitter Card (7), Dublin Core (7), Article (5)
- JSON-LD parsing, favicons, RSS/Atom feeds, alternate hreflang links
- Charset, viewport, theme-color, robots, referrer, generator
- Null values auto-stripped for cleaner output
Filter Denial Reasons
applyWithReason()on every filter returns{ allowed, reason, filter }- Specific reasons: "Domain X is blocked", "Matched exclude pattern: Y", "File extension .pdf is blocked"
FilterChain.getDenials()andgetDenialsByFilter()for aggregate tracking- Fully backward compatible —
apply()still returns boolean
Browser Utilities
- Interactive element detection —
detectInteractiveElements()finds all clickable elements via single JS evaluation (cursor:pointer, onclick, tabindex, ARIA roles) - Iframe content inlining —
extractIframeContent()+inlineIframeContent() - Storage state persistence —
saveStorageState()/loadStorageState()for cookies + localStorage
AI-Friendly Errors
toFriendlyError()converts 20+ error patterns (DNS, timeout, SSL, element interaction, browser crashes) into actionable messageswithFriendlyErrors()wrapper for any async operation- Auto-applied in
crawler.crawl()error handler
Testing
- 277 unit/integration tests (up from 191)
- 115 dogfood checks against real websites
- Battle tests: engine fallback, redirects, timeouts, 404s, cache modes, screenshots, custom JS, network capture
v0.1.0
Initial release — full-featured web crawler for TypeScript/Bun
Core Crawling
WebCrawlerwithcrawl(),crawlMany(),processHtml()methodsBrowserConfigandCrawlerRunConfigwith typed defaults- Concurrent crawling with configurable concurrency limit
- Context manager pattern with
start()/close()lifecycle
Engine System
- Fetch-first architecture: tries lightweight HTTP fetch before launching a browser
FetchEngine(quality 5) — simple HTTP, no browser overheadPlaywrightEngine(quality 50) — full Chromium/Firefox/WebKit- Auto-escalation: detects SPA shells (React, Next.js, Nuxt) and switches to browser
EngineManagerwith quality-scored engine selection
Browser Backends
- Playwright — Chromium, Firefox, WebKit
- Lightpanda — local mode via
@lightpanda/browser, cloud mode via CDP WebSocket
Content Processing
- HTML cleaning via Cheerio (strips scripts, styles, noise tags)
- Link extraction with internal/external classification
- Media extraction (images, videos, audio) with quality scoring
- Rich metadata extraction — 50+ fields: Open Graph, Twitter Cards, Dublin Core, JSON-LD, article tags, favicons, feeds, alternates
- Markdown generation via Turndown with citation support
- Accessibility tree snapshots — compact semantic page representation with
@erefs
Extraction Strategies
- CSS selector extraction — map selectors to JSON fields
- Regex extraction — pattern matching with named capture groups
- XPath extraction — XPath-to-CSS conversion
- Table extraction — structured headers, rows, captions
Deep Crawling
- BFS (Breadth-First Search) — level-by-level with concurrent batching
- DFS (Depth-First Search) — single-path depth exploration
- BestFirst — score-based priority queue using composite scorers
- Streaming mode via
deepCrawlStream()async generator maxDepth,maxPages,concurrencycontrols
URL Filtering
URLPatternFilter— glob/regex include/exclude patternsDomainFilter— whitelist/blacklist domainsContentTypeFilter— extension-based filteringMaxDepthFilter— depth limit per URLFilterChain— composable, short-circuit evaluation- Denial reasons — track why each URL was rejected with
getDenials()/getDenialsByFilter()
URL Scoring
KeywordRelevanceScorer— match keywords in URL and anchor textPathDepthScorer— shallower paths score higherFreshnessScorer— URLs with recent dates score higherDomainAuthorityScorer— preferred domains score highestCompositeScorer— weighted averaging of multiple scorers
Caching
- SQLite-based cache via
bun:sqlitewith WAL mode - 5 cache modes: Enabled, Disabled, ReadOnly, WriteOnly, Bypass
CacheValidator— HTTP HEAD requests with ETag/Last-Modified- Batch insert via
setMany()(atomic transactions)
Rate Limiting & Compliance
- Per-domain rate limiter with exponential backoff on 429/503
- Gradual recovery on success
- Configurable jitter, max delay, backoff/recovery factors
- Robots.txt parser — User-agent matching, Allow/Disallow, Crawl-delay, Sitemap discovery, wildcard patterns
Anti-Bot
isBlocked()— detects Cloudflare challenges, CAPTCHAs, 403/429/503 blocksapplyStealthMode()— overrides navigator.webdriver, plugins, languagessimulateUser()— random mouse movements and scrollingwithRetry()— automatic retry with escalating delays
Content Filtering
PruningContentFilter— rule-based boilerplate removalBM25ContentFilter— relevance-based filtering by query
Chunking
RegexChunking— split by patterns (default: paragraphs)SlidingWindowChunking— word-count windows with overlapFixedSizeChunking— character-count chunks with overlapIdentityChunking— no splitting
Browser Utilities
- Interactive element detection — finds all clickable elements including cursor:pointer, onclick, tabindex
- Iframe content inlining — extracts and merges iframe content into parent HTML
- Storage state persistence — save/load cookies and localStorage between sessions
- Hooks — onPageCreated, beforeGoto, afterGoto, onExecutionStarted, beforeReturnHtml
Infrastructure
- Proxy rotation — round-robin with health tracking and auto-recovery
- URL seeder — sitemap discovery via robots.txt chain
- Crawler monitor — real-time stats (pages/sec, success rates, data volume)
- AI-friendly errors — converts 20+ error patterns into actionable messages
- Logging — ConsoleLogger with level filtering, SilentLogger, pluggable Logger interface
Developer Experience
- Native TypeScript execution via Bun (no build step)
- 260 tests via
bun:test - Biome for linting and formatting
- GitHub Actions CI
- Apache-2.0 license
Edit on GitHub
Last updated on