feedstock

CSS Extraction

Extract structured data from HTML using CSS selectors.

The CssExtractionStrategy maps CSS selectors to JSON fields, letting you extract structured data from any page with consistent markup.

Schema Definition

interface CssExtractionSchema {
  name: string;          // Schema name (for identification)
  baseSelector: string;  // CSS selector for repeating elements
  fields: CssField[];    // Fields to extract from each element
}

interface CssField {
  name: string;                              // Output field name
  selector: string;                          // CSS selector within base element
  type: "text" | "attribute" | "html" | "list"; // Extraction type
  attribute?: string;                        // For "attribute" type (default: "href")
}

Field Types

TypeDescriptionExample
textInner text content"Widget A"
attributeHTML attribute value"/products/widget-a"
htmlInner HTML"<strong>Bold</strong> text"
listArray of text from all matches["tag1", "tag2"]

Example: Product Scraping

const result = await crawler.crawl("https://store.example.com", {
  extractionStrategy: {
    type: "css",
    params: {
      name: "products",
      baseSelector: ".product-card",
      fields: [
        { name: "title", selector: ".product-title", type: "text" },
        { name: "price", selector: ".price", type: "text" },
        { name: "url", selector: "a.product-link", type: "attribute", attribute: "href" },
        { name: "image", selector: "img", type: "attribute", attribute: "src" },
        { name: "tags", selector: ".tag", type: "list" },
        { name: "description", selector: ".desc", type: "html" },
      ],
    },
  },
});

const products = JSON.parse(result.extractedContent!).map(
  (item) => JSON.parse(item.content)
);
// [{ title: "Widget A", price: "$9.99", url: "/widget-a", tags: ["sale", "new"] }, ...]

Direct Usage

import { CssExtractionStrategy } from "feedstock";

const strategy = new CssExtractionStrategy({
  name: "articles",
  baseSelector: "article",
  fields: [
    { name: "headline", selector: "h2", type: "text" },
    { name: "body", selector: ".content", type: "html" },
  ],
});

const items = await strategy.extract(url, html);

Each extracted item includes both content (JSON string) and metadata (parsed object) for convenience.

Edit on GitHub

Last updated on

On this page