feedstock

Change Tracking

Detect new, changed, unchanged, and removed pages between crawl runs.

The ChangeTracker compares crawl results across runs by hashing content and detecting differences. It stores snapshots in SQLite and generates text diffs for changed pages.

Quick Start

import { WebCrawler, ChangeTracker, CacheMode } from "feedstock";

const crawler = new WebCrawler();
const tracker = new ChangeTracker();

// First crawl
const results = await crawler.deepCrawl("https://example.com", {
  cacheMode: CacheMode.Bypass,
}, { maxDepth: 2, maxPages: 50 });

const report = tracker.compare(results);
console.log(report.summary);
// { total: 50, new: 50, changed: 0, unchanged: 0, removed: 0 }

// ... time passes, content changes ...

// Second crawl
const results2 = await crawler.deepCrawl("https://example.com", {
  cacheMode: CacheMode.Bypass,
}, { maxDepth: 2, maxPages: 50 });

const report2 = tracker.compare(results2);
console.log(report2.summary);
// { total: 53, new: 3, changed: 5, unchanged: 42, removed: 0 }

tracker.close();
await crawler.close();

Change Statuses

StatusMeaning
newURL exists now but not in previous snapshot
changedURL exists in both but content hash differs
unchangedURL exists in both with identical content
removedURL was in previous snapshot but not current

Change Report

interface ChangeReport {
  snapshotId: string;
  previousSnapshotId: string | null;
  timestamp: number;
  summary: {
    total: number;
    new: number;
    changed: number;
    unchanged: number;
    removed: number;
  };
  changes: PageChange[];
}

Working with Changes

// Filter by status
const newPages = report.changes.filter(c => c.status === "new");
const changed = report.changes.filter(c => c.status === "changed");
const removed = report.changes.filter(c => c.status === "removed");

// Inspect a change
for (const change of changed) {
  console.log(`${change.url} changed`);
  console.log(`  Title: "${change.previousTitle}" → "${change.currentTitle}"`);
  
  if (change.diff) {
    console.log(`  +${change.diff.additions} -${change.diff.deletions} lines`);
    for (const chunk of change.diff.chunks) {
      const prefix = chunk.type === "add" ? "+" : chunk.type === "remove" ? "-" : " ";
      for (const line of chunk.lines) {
        console.log(`  ${prefix} ${line}`);
      }
    }
  }
}

Text Diffs

Changed pages include a line-by-line diff:

interface TextDiff {
  additions: number;    // lines added
  deletions: number;    // lines removed
  chunks: DiffChunk[];  // grouped changes
}

interface DiffChunk {
  type: "add" | "remove" | "context";
  lines: string[];
}

By default, diffs are computed on markdown content. Set diffMarkdown: false to diff cleaned HTML instead.

Configuration

const tracker = new ChangeTracker({
  dbPath: "/path/to/changes.db",  // default: ~/.feedstock/changes.db
  config: {
    includeDiffs: true,    // generate text diffs (default: true)
    diffMarkdown: true,    // diff markdown vs HTML (default: true)
    maxDiffChunks: 50,     // limit diff output (default: 50)
  },
});

Snapshot Management

// List all snapshots
const snapshots = tracker.listSnapshots();
// [{ id: "snap_1234", pageCount: 50, createdAt: 1712534400000 }]

// Delete a specific snapshot
tracker.deleteSnapshot("snap_1234");

// Prune snapshots older than 7 days
const removed = tracker.pruneOlderThan(7 * 24 * 60 * 60 * 1000);
console.log(`Removed ${removed} old entries`);

Custom Snapshot IDs

// Use custom IDs for meaningful tracking
tracker.compare(results, "prod-2024-04-07");
tracker.compare(results, "prod-2024-04-08");

Default: snap_{timestamp} if no ID provided.

Edit on GitHub

Last updated on

On this page