PROJECT

logo

DeepDrill

Building the most powerful structured data extractor for the web

04/2026

A while back I noticed I was extracting a lot of data from the web. From research projects to commercial projects, a recurring theme was the requirement for structured data. The issue was that I was forced to build a custom solution for each application. ScrapeGraphAI and raw LLM calls worked reasonably well even for complex schemas, but got expensive and slow at scale. On the other hand, heuristic tools like Trafilatura are fast and cheap, great for main content extraction, but they fall apart on complex or dynamic pages and can't handle arbitrary schema. Rule-based DOM extraction doesn't scale either since every site needs its own logic. I wanted something that combines the best of both worlds. The goal is simple: get the flexibility with AI and get the efficiency with state of the art heuristics. This way we can collect structured data in production from millions of pages with very little configuration.

That is why I started DeepDrill. It is a module designed for scalable and automated content extraction. The goal is to allow efficient extraction of any type of schema, with recursive crawling and smart evaluation.

Example - News Extraction

Define Schema

{
title: text
author: text
topic: value
link: url
}

+

Give URL

nytimes.com

logo

DeepDrill

GET

{
"Trump has attacked Iran Once Again"
"Tom Collins"
"War"
"https://nytimes.com/trump-started-another-war"
}

{
"Deepfake Nudes Are Haunting America's Teens"
"Margaret Smith"
"Technology"
"https://nytimes.com/technology/deepfake-nudes"
}

...

For any content extractor - LLM based on not - the first step is usually to remove all scripts, styles, ads etc from the HTML. This is pretty simple, and it's the first step DeepDrill employs as well. The interesting part comes after. Even after stripping the most obvious boilerplate HTML, a lot of meaningless nesting and noise still exist. DeepDrill handles this by parsing the cleaned HTML into a semantic tree, which collapses unnecessary nesting while preserving meaningful structural relationships in the DOM. The LLM sees a compact, semantically rich representation instead of a wall of divs, and is still able infer item relationships from the HTML structure.

Other key-features include multi-item extraction and automatic recursive completion of content. Simply point DeepDrill to a listing page, define your schema once, and it figures out the repeating structure, extracts every item, and can follow links to sub-pages to fill in deeper fields — all without writing site-specific logic. That is the part that actually matters at scale.

Example — Startup Intelligence

// define what you want

schema := deepdrill.Schema{

Fields: []deepdrill.Field{

{Name: "company", Type: deepdrill.FieldTypeText, Hint: "company name"},

{Name: "funding_stage", Type: deepdrill.FieldTypeText, Hint: "latest funding round", Options: []string{"pre-seed", "seed", "series-a", "series-b+"}},

{Name: "sector", Type: deepdrill.FieldTypeText, Hint: "industry vertical", Options: []string{"ai", "fintech", "biotech", "devtools", "climate"}},

{Name: "amount_usd", Type: deepdrill.FieldTypeValue, Hint: "funding amount in USD, numeric only"},

{Name: "investors", Type: deepdrill.FieldTypeText, Hint: "lead investors, comma separated"},

{Name: "is_hiring", Type: deepdrill.FieldTypeFlag, Hint: "any signals of active hiring"},

},

}

// point at a url

results, _ := deepdrill.Fill(ctx, schema, deepdrill.Options{

URL: "https://techcrunch.com/category/venture",

Multiple: true,

})

// get structured json

[

{ "company": "Archetype AI", "funding_stage": "series-a", "sector": "ai", "amount_usd": 13000000, "investors": "Sequoia, Intel Capital", "is_hiring": true },

{ "company": "Supaglue", "funding_stage": "seed", "sector": "devtools", "amount_usd": 3400000, "investors": "a16z", "is_hiring": false },

...

]

There is a lot more on the roadmap — smarter tree splitting, heuristic caching to skip the LLM entirely on repeat visits, and eventually a small fine-tuned model trained on accumulated extraction data. But the foundation is there and it already works.

logo

© 2026 Neon Stack