feedstock

Crawling

Core crawling methods and the WebCrawler class.

The WebCrawler class is the main entry point for feedstock. It manages the browser lifecycle, caching, scraping, and extraction pipeline.

Creating a Crawler

import { WebCrawler } from "feedstock";

// Minimal — uses all defaults
const crawler = new WebCrawler();

// With options
const crawler = new WebCrawler({
  verbose: true,
  config: {
    browserType: "chromium",
    headless: true,
    viewport: { width: 1280, height: 720 },
  },
});

Lifecycle

The crawler must be started before use and closed when done:

await crawler.start();   // launches browser, opens cache
// ... crawl pages ...
await crawler.close();   // closes browser, closes cache

If you call crawl() without calling start() first, the crawler will auto-start.

Single Page Crawl

const result = await crawler.crawl("https://example.com", {
  cacheMode: CacheMode.Bypass,
  waitFor: { kind: "selector", value: "#content" },
  screenshot: true,
});

The returned CrawlResult contains everything extracted from the page:

FieldTypeDescription
urlstringThe crawled URL
htmlstringRaw page HTML
cleanedHtmlstring | nullHTML with scripts/styles removed
markdownMarkdownGenerationResult | nullConverted markdown
linksLinksInternal and external links
mediaMediaImages, videos, and audio
metadataRecord<string, unknown> | nullPage metadata
extractedContentstring | nullStructured extraction results
statusCodenumber | nullHTTP status code
screenshotstring | nullBase64-encoded screenshot
pdfBuffer | nullPDF capture

Multiple URLs

const results = await crawler.crawlMany(
  urls,
  { cacheMode: CacheMode.Bypass },
  { concurrency: 5 },
);

URLs are crawled concurrently with the specified concurrency limit (default 5).

Process Raw HTML

Skip the browser entirely and process HTML directly:

const result = await crawler.processHtml(htmlString, {
  generateMarkdown: true,
  extractionStrategy: { type: "css", params: schema },
});

WebCrawlerOptions

interface WebCrawlerOptions {
  config?: Partial<BrowserConfig>;
  crawlerStrategy?: CrawlerStrategy;
  scrapingStrategy?: ContentScrapingStrategy;
  markdownGenerator?: MarkdownGenerationStrategy;
  logger?: Logger;
  verbose?: boolean;
}

Every component is swappable via the constructor. Pass a custom CrawlerStrategy to change how pages are fetched, a custom ContentScrapingStrategy to change how HTML is cleaned, or a custom MarkdownGenerationStrategy to change markdown output.

Edit on GitHub

Last updated on

On this page