PROJECT

DeepDrill

Building the most powerful structured data extractor for the web

04/2026

A while back I noticed I was extracting a lot of data from the web. From research projects to commercial projects, a recurring theme was the requirement for structured data. The issue was that I was forced to build a custom solution for each application. ScrapeGraphAI and raw LLM calls worked reasonably well even for complex schemas, but got expensive and slow at scale. On the other hand, heuristic tools like Trafilatura are fast and cheap, great for main content extraction, but they fall apart on complex or dynamic pages and can't handle arbitrary schema. Rule-based DOM extraction doesn't scale either since every site needs its own logic. I wanted something that combines the best of both worlds. The goal is simple: get the flexibility with AI and get the efficiency with state of the art heuristics. This way we can collect structured data in production from millions of pages with very little configuration.

That is why I started DeepDrill. It is a module designed for scalable and automated content extraction. The goal is to allow efficient extraction of any type of schema, with recursive crawling and smart evaluation.

Example - News Extraction

Define Schema

{
title: text
author: text
topic: value
link: url
}

Give URL

nytimes.com

→↓

DeepDrill

→↓

GET

{
"Trump has attacked Iran Once Again"
"Tom Collins"
"War"
"https://nytimes.com/trump-started-another-war"
}

{
"Deepfake Nudes Are Haunting America's Teens"
"Margaret Smith"
"Technology"
"https://nytimes.com/technology/deepfake-nudes"
}

...

For any content extractor - LLM based on not - the first step is usually to remove all scripts, styles, ads etc from the HTML. This is pretty simple, and it's the first step DeepDrill employs as well. The interesting part comes after. Even after stripping the most obvious boilerplate HTML, a lot of meaningless nesting and noise still exist. DeepDrill handles this by parsing the cleaned HTML into a semantic tree, which collapses unnecessary nesting while preserving meaningful structural relationships in the DOM. The LLM sees a compact, semantically rich representation instead of a wall of divs, and is still able infer item relationships from the HTML structure.

Other key-features include multi-item extraction and automatic recursive completion of content. Simply point DeepDrill to a listing page, define your schema once, and it figures out the repeating structure, extracts every item, and can follow links to sub-pages to fill in deeper fields — all without writing site-specific logic. That is the part that actually matters at scale.

Example — Startup Intelligence

// define what you want

schema := deepdrill.Schema{

Fields: []deepdrill.Field{

{Name: "company", Type: deepdrill.FieldTypeText, Hint: "company name"},

{Name: "funding_stage", Type: deepdrill.FieldTypeText, Hint: "latest funding round", Options: []string{"pre-seed", "seed", "series-a", "series-b+"}},

{Name: "sector", Type: deepdrill.FieldTypeText, Hint: "industry vertical", Options: []string{"ai", "fintech", "biotech", "devtools", "climate"}},

{Name: "amount_usd", Type: deepdrill.FieldTypeValue, Hint: "funding amount in USD, numeric only"},

{Name: "investors", Type: deepdrill.FieldTypeText, Hint: "lead investors, comma separated"},

{Name: "is_hiring", Type: deepdrill.FieldTypeFlag, Hint: "any signals of active hiring"},

}

// point at a url

results, _ := deepdrill.Fill(ctx, schema, deepdrill.Options{

URL: "https://techcrunch.com/category/venture",

Multiple: true,

})

// get structured json

[

{ "company": "Archetype AI", "funding_stage": "series-a", "sector": "ai", "amount_usd": 13000000, "investors": "Sequoia, Intel Capital", "is_hiring": true },

{ "company": "Supaglue", "funding_stage": "seed", "sector": "devtools", "amount_usd": 3400000, "investors": "a16z", "is_hiring": false },

...

]

There is a lot more on the roadmap — smarter tree splitting, heuristic caching to skip the LLM entirely on repeat visits, and eventually a small fine-tuned model trained on accumulated extraction data. But the foundation is there and it already works.

These papers present some interesting related ideas

HtmlRAG

Tan et al. (2024) — proposes using HTML directly instead of plain text in RAG pipelines, with cleaning and block-tree pruning to reduce noise. Similar pruning philosophy to DeepDrill, but designed for query answering rather than structured extraction.

AXE

Mansour et al. (2026) — treats the DOM as a tree to prune rather than text to read, uses a 0.6B LLM with grounded XPath resolution to keep extractions traceable. Strong zero-shot results on SWDE. No caching, no multi-item or nested schema support, no recursive extraction — which is where DeepDrill picks up.

Join building

DeepDrill is open source and early stage — good time to get involved. Star the repo or open a PR :)

About

Home

Contact

Connect