feedstock

Table Extraction

Extract HTML tables into structured data.

The TableExtractionStrategy parses HTML tables into structured objects with headers, rows, and captions.

Usage

import { TableExtractionStrategy } from "feedstock";

const strategy = new TableExtractionStrategy();
const tables = await strategy.extract(url, html);

for (const table of tables) {
  const data = JSON.parse(table.content);
  console.log("Headers:", data.headers);
  console.log("Rows:", data.rows);
  console.log("Caption:", data.caption);
}

Output Format

{
  headers: ["Name", "Age", "City"],
  rows: [
    ["Alice", "30", "New York"],
    ["Bob", "25", "San Francisco"],
  ],
  caption: "User Data",     // from <caption> element
  rowCount: 2,
  columnCount: 3,
}

Options

new TableExtractionStrategy({
  minRows: 2,           // skip tables with fewer rows (default: 1)
  includeCaption: true,  // extract <caption> text (default: true)
})

With Crawler

const result = await crawler.crawl("https://example.com/data", {
  extractionStrategy: {
    type: "css",  // or use TableExtractionStrategy directly via processHtml
    params: { ... },
  },
});

For direct table extraction, use processHtml:

const strategy = new TableExtractionStrategy({ minRows: 2 });
const tables = await strategy.extract(result.url, result.cleanedHtml!);
Edit on GitHub

Last updated on

On this page