feedstock

Build a Site Crawler

Deep crawl an entire site with filters, rate limiting, and robots.txt compliance.

This guide builds a production-ready site crawler that recursively discovers and crawls pages while respecting site policies.

The Setup

import {
  WebCrawler,
  CacheMode,
  FilterChain,
  DomainFilter,
  ContentTypeFilter,
  URLPatternFilter,
  RateLimiter,
  RobotsParser,
  CompositeScorer,
  KeywordRelevanceScorer,
  PathDepthScorer,
} from "feedstock";

Basic Deep Crawl

const crawler = new WebCrawler();

const results = await crawler.deepCrawl(
  "https://docs.example.com",
  { cacheMode: CacheMode.Enabled },
  { maxDepth: 3, maxPages: 200 },
);

console.log(`Crawled ${results.length} pages`);
for (const r of results) {
  console.log(`${r.statusCode} ${r.url}`);
}

await crawler.close();

Adding Filters

Restrict crawling to relevant pages:

const filterChain = new FilterChain()
  .add(new DomainFilter({ allowed: ["docs.example.com"] }))
  .add(new ContentTypeFilter())  // skip images, PDFs, etc.
  .add(new URLPatternFilter({
    exclude: [/\/api\//, /\/login/, /\/signup/, /\?/],
  }));

Adding Rate Limiting and Robots.txt

Be a good citizen:

const rateLimiter = new RateLimiter({ baseDelay: 500 });
const robotsParser = new RobotsParser("my-site-crawler");

Prioritized Crawling

Use scorers to crawl the most relevant pages first:

const scorer = new CompositeScorer()
  .add(new KeywordRelevanceScorer(["guide", "tutorial", "api"], 2.0))
  .add(new PathDepthScorer(8, 1.0));

Putting It All Together

const crawler = new WebCrawler({ verbose: true });

const results = await crawler.deepCrawl(
  "https://docs.example.com",
  {
    cacheMode: CacheMode.Enabled,
    generateMarkdown: true,
    excludeTags: ["nav", "footer", "aside"],
  },
  {
    maxDepth: 3,
    maxPages: 500,
    concurrency: 3,
    filterChain,
    rateLimiter,
    robotsParser,
    scorer,
  },
);

// Save results
for (const result of results) {
  if (result.success && result.markdown) {
    const filename = result.url.replace(/[^a-z0-9]/gi, "_") + ".md";
    await Bun.write(`output/${filename}`, result.markdown.rawMarkdown);
  }
}

console.log(`Saved ${results.length} pages to output/`);
await crawler.close();

Streaming for Large Sites

For sites with thousands of pages, use streaming to avoid holding everything in memory:

let count = 0;

for await (const result of crawler.deepCrawlStream(
  "https://docs.example.com",
  { cacheMode: CacheMode.Enabled },
  { maxDepth: 3, maxPages: 1000, filterChain, rateLimiter },
)) {
  count++;
  if (result.success) {
    // Process and discard immediately
    await processPage(result);
  }
  if (count % 50 === 0) {
    console.log(`Progress: ${count} pages crawled`);
  }
}
Edit on GitHub

Last updated on

On this page