PROJECT
DeepDrill
Building the most powerful structured data extractor for the web
04/2026
A while back I noticed I was extracting a lot of data from the web. From research projects to commercial projects, a recurring theme was the requirement for structured data. The issue was that I was forced to build a custom solution for each application. ScrapeGraphAI and raw LLM calls worked reasonably well even for complex schemas, but got expensive and slow at scale. On the other hand, heuristic tools like Trafilatura are fast and cheap, great for main content extraction, but they fall apart on complex or dynamic pages and can't handle arbitrary schema. Rule-based DOM extraction doesn't scale either since every site needs its own logic. I wanted something that combines the best of both worlds. The goal is simple: get the flexibility with AI and get the efficiency with state of the art heuristics. This way we can collect structured data in production from millions of pages with very little configuration.
That is why I started DeepDrill. It is a module designed for scalable and automated content extraction. The goal is to allow efficient extraction of any type of schema, with recursive crawling and smart evaluation.
Example - News Extraction
Define Schema
{
title: text
author: text
topic: value
link: url
}
+
Give URL
nytimes.com
→↓
DeepDrill
→↓
GET
{
"Trump has attacked Iran Once Again"
"Tom Collins"
"War"
"https://nytimes.com/trump-started-another-war"
}
{
"Deepfake Nudes Are Haunting America's Teens"
"Margaret Smith"
"Technology"
"https://nytimes.com/technology/deepfake-nudes"
}
...
For any content extractor - LLM based on not - the first step is usually to remove all scripts, styles, ads etc from the HTML. This is pretty simple, and it's the first step DeepDrill employs as well. The interesting part comes after. Even after stripping the most obvious boilerplate HTML, a lot of meaningless nesting and noise still exist. DeepDrill handles this by parsing the cleaned HTML into a semantic tree, which collapses unnecessary nesting while preserving meaningful structural relationships in the DOM. The LLM sees a compact, semantically rich representation instead of a wall of divs, and is still able infer item relationships from the HTML structure.
Other key-features include multi-item extraction and automatic recursive completion of content. Simply point DeepDrill to a listing page, define your schema once, and it figures out the repeating structure, extracts every item, and can follow links to sub-pages to fill in deeper fields — all without writing site-specific logic. That is the part that actually matters at scale.
Example — Startup Intelligence
// define what you want
schema := deepdrill.Schema{
Fields: []deepdrill.Field{
{Name: "company", Type: deepdrill.FieldTypeText, Hint: "company name"},
{Name: "funding_stage", Type: deepdrill.FieldTypeText, Hint: "latest funding round", Options: []string{"pre-seed", "seed", "series-a", "series-b+"}},
{Name: "sector", Type: deepdrill.FieldTypeText, Hint: "industry vertical", Options: []string{"ai", "fintech", "biotech", "devtools", "climate"}},
{Name: "amount_usd", Type: deepdrill.FieldTypeValue, Hint: "funding amount in USD, numeric only"},
{Name: "investors", Type: deepdrill.FieldTypeText, Hint: "lead investors, comma separated"},
{Name: "is_hiring", Type: deepdrill.FieldTypeFlag, Hint: "any signals of active hiring"},
},
}
// point at a url
results, _ := deepdrill.Fill(ctx, schema, deepdrill.Options{
URL: "https://techcrunch.com/category/venture",
Multiple: true,
})
// get structured json
[
{ "company": "Archetype AI", "funding_stage": "series-a", "sector": "ai", "amount_usd": 13000000, "investors": "Sequoia, Intel Capital", "is_hiring": true },
{ "company": "Supaglue", "funding_stage": "seed", "sector": "devtools", "amount_usd": 3400000, "investors": "a16z", "is_hiring": false },
...
]
There is a lot more on the roadmap — smarter tree splitting, heuristic caching to skip the LLM entirely on repeat visits, and eventually a small fine-tuned model trained on accumulated extraction data. But the foundation is there and it already works.
RELATED
These papers present some interesting related ideas
HtmlRAG
Tan et al. (2024) — proposes using HTML directly instead of plain text in RAG pipelines, with cleaning and block-tree pruning to reduce noise. Similar pruning philosophy to DeepDrill, but designed for query answering rather than structured extraction.
AXE
Mansour et al. (2026) — treats the DOM as a tree to prune rather than text to read, uses a 0.6B LLM with grounded XPath resolution to keep extractions traceable. Strong zero-shot results on SWDE. No caching, no multi-item or nested schema support, no recursive extraction — which is where DeepDrill picks up.