feedstock

Deep Crawling

Recursively crawl entire sites with BFS, DFS, or BestFirst strategies.

Deep crawling follows links from a starting URL and recursively crawls discovered pages. Feedstock provides three traversal strategies.

Quick Start

const results = await crawler.deepCrawl(
  "https://example.com",
  { cacheMode: CacheMode.Bypass },
  { maxDepth: 2, maxPages: 50 },
);

for (const result of results) {
  console.log(`${result.url}: ${result.success}`);
}

Streaming

For large crawls, use streaming to process results as they arrive:

for await (const result of crawler.deepCrawlStream(
  "https://example.com",
  { cacheMode: CacheMode.Bypass },
  { maxDepth: 3, maxPages: 100 },
)) {
  console.log(`Crawled: ${result.url}`);
  // Process each result immediately
}

Strategies

Default strategy. Explores all URLs at depth N before moving to depth N+1.

  • Best for: broad coverage, sitemap discovery
  • Processes pages level by level with concurrent batching

Follows a single path to max depth before backtracking.

  • Best for: deep section exploration, finding deeply nested content

BestFirst (Score-Based)

Prioritizes URLs by score using a CompositeScorer. Automatically selected when you provide a scorer in the config.

import { CompositeScorer, KeywordRelevanceScorer, PathDepthScorer } from "feedstock";

const scorer = new CompositeScorer()
  .add(new KeywordRelevanceScorer(["docs", "api"], 2.0))
  .add(new PathDepthScorer(10, 1.0));

const results = await crawler.deepCrawl(
  "https://example.com",
  {},
  { maxDepth: 3, maxPages: 50, scorer },
);

DeepCrawlConfig

interface DeepCrawlConfig {
  maxDepth: number;          // Max link-following depth (default: 3)
  maxPages: number;          // Max pages to crawl (default: 100)
  concurrency: number;       // Concurrent page fetches (default: 5)
  filterChain?: FilterChain; // URL filter chain
  scorer?: CompositeScorer;  // URL scorer (enables BestFirst)
  rateLimiter?: RateLimiter; // Per-domain rate limiting
  robotsParser?: RobotsParser; // Robots.txt compliance
  logger?: Logger;           // Logger instance
}

With Filters and Rate Limiting

import {
  FilterChain, DomainFilter, ContentTypeFilter,
  RateLimiter, RobotsParser,
} from "feedstock";

const results = await crawler.deepCrawl(
  "https://example.com",
  { cacheMode: CacheMode.Bypass },
  {
    maxDepth: 2,
    maxPages: 100,
    filterChain: new FilterChain()
      .add(new DomainFilter({ allowed: ["example.com"] }))
      .add(new ContentTypeFilter()),
    rateLimiter: new RateLimiter({ baseDelay: 500 }),
    robotsParser: new RobotsParser(),
  },
);
Edit on GitHub

Last updated on

On this page