feedstock

Filter Denial Reasons

Track why URLs were rejected during deep crawling.

Filters now track why each URL was rejected, not just that it was. This helps debug crawl coverage issues and understand filter behavior.

Getting Denial Reasons

Per-Filter

const filter = new DomainFilter({ allowed: ["example.com"] });
const result = await filter.applyWithReason("https://other.com/page");

console.log(result);
// {
//   allowed: false,
//   reason: 'Domain "other.com" is not in allowed list',
//   filter: "domain"
// }

From FilterChain

const chain = new FilterChain()
  .add(new DomainFilter({ allowed: ["example.com"] }))
  .add(new URLPatternFilter({ exclude: [/\/admin/] }))
  .add(new ContentTypeFilter());

// Crawl with this chain...
await chain.apply("https://other.com/page");       // denied: domain
await chain.apply("https://example.com/admin");     // denied: pattern
await chain.apply("https://example.com/file.pdf");  // denied: content-type
await chain.apply("https://example.com/docs");      // allowed

// Get all denials
const denials = chain.getDenials();
// [
//   { url: "https://other.com/page", reason: 'Domain "other.com" is not in allowed list', filter: "domain" },
//   { url: "https://example.com/admin", reason: "Matched exclude pattern: \\/admin", filter: "url-pattern" },
//   { url: "https://example.com/file.pdf", reason: 'File extension ".pdf" is blocked', filter: "content-type" },
// ]

// Group by filter
const byFilter = chain.getDenialsByFilter();
// { "domain": [...], "url-pattern": [...], "content-type": [...] }

Denial Reasons by Filter

FilterExample Reasons
URLPatternFilterMatched exclude pattern: \/admin, Did not match any include pattern
DomainFilterDomain "other.com" is not in allowed list, Domain "ads.com" is blocked
ContentTypeFilterFile extension ".pdf" is blocked, File extension ".xyz" is not in allowed list
MaxDepthFilterDepth 4 exceeds max depth 3

Backward Compatibility

The existing apply() method still returns a boolean. Use applyWithReason() when you need the reason:

// Old API — still works
const allowed = await filter.apply(url); // boolean

// New API — with reason
const result = await filter.applyWithReason(url); // { allowed, reason?, filter? }

The FilterChain.apply() now tracks denials internally even when returning boolean, so you can always call getDenials() afterward.

Clearing Denials

chain.clearDenials();
Edit on GitHub

Last updated on

On this page