Extracting tables from PDFs: a practical 2026 guide
Table extraction is where document-parsing approaches diverge most sharply. A table with clean borders and digital text is trivial for any modern tool. A scanned table with merged cells, rotated column headers, and varying row heights is where most approaches fail in ways that are difficult to detect automatically.
This guide covers four main approaches in 2026 — Tesseract OCR with LLM stitching, AWS Textract, pure-VLM pipelines, and multi-model routers — with honest notes on where each breaks.
Approach 1: Tesseract OCR + LLM stitching
The classic open-source path: run Tesseract to get character-level bounding boxes, group boxes into cells by proximity heuristics, then feed the result to an LLM to reconstruct table structure as Markdown.
Where it works: clean digital PDFs with solid-line borders, uniform row heights, no merged cells. CER on printed text is under 2% with Tesseract 5.x at 300 DPI.
Where it fails: merged cells. Tesseract boxes each character independently and has no concept of a cell spanning multiple columns. The stitching LLM infers the span from whitespace patterns — a guess that breaks on complex tables in our internal eval (benchmark in progress; public release pending). Rotated column headers trigger the same failure mode: Tesseract reads them as a separate region and the stitching step loses the association with the column below.
Approach 2: AWS Textract
Textract’s AnalyzeDocument API returns cell-level structure with MERGED_CELL block types. On the FUNSD form-understanding benchmark (Jaume et al., 2019), Textract reaches F1 in the mid-0.80s on form-field extraction. Table-specific cell-level F1 varies by document type and scan quality — Textract publishes no public per-table-type benchmarks.
Where it works: US government forms, standardized invoices, financial statements from US-listed companies where formatting is consistent.
Where it fails: academic tables with irregular colspan patterns; medical tables with nested sub-headers; documents below 200 DPI; non-Latin scripts. Pricing: $0.015–$0.065/page (verify current rates on the AWS pricing page).
Approach 3: Vision-language models
A VLM sees the page as an image, interpreting visual structure the way a human reader does. On the DocVQA benchmark (Mathew et al., 2021), frontier VLMs reach ANLS above 0.90 on document-understanding tasks. DocVQA focuses on answer extraction rather than cell-level table reconstruction, so it is not a direct cell-F1 proxy — but it reflects underlying visual reasoning capability.
Where VLMs work: merged cells, rotated headers, scanned documents, tables embedded in figures, non-Latin scripts, and tables where the visual formatting is the only structural signal (no PDF text layer).
Where VLMs fail: very long tables that exceed visual resolution; hallucination on partially visible cells at page edges; token cost on large tables. Frontier model pricing: $0.003–$0.015/page, 2–8s latency per page.
Approach 4: Multi-model routing
Route each page to the right model based on structural complexity. A clean digital-text table gets a Fast-tier model ($0.003/page, <2s). A page with merged cells, rotated headers, and a 300 DPI scan gets an Expert-tier model ($0.012/page, 4–8s). Expert-tier accuracy on the hard pages, Fast-tier cost on the easy ones. The routing decision is logged per page — so you can audit why a specific table was handled the way it was.
Comparison at a glance
| Approach | Merged cells | Scanned / skewed | Approx. cost/page |
|---|---|---|---|
| Tesseract + LLM | Unreliable | Poor | $0.001–0.005 |
| AWS Textract | Good | Moderate | $0.015–0.065 |
| VLM only | Excellent | Excellent | $0.003–0.015 |
| Multi-model router | Excellent | Excellent | $0.003–0.012 (blended) |
Cost figures are approximate 2026-Q2 market rates. Cell-level F1 benchmarks for complex tables are an internal benchmark in progress — public release pending.
Code recipe
Send a PDF URL to Docira and print the Markdown output with routing metadata per page:
import httpx
PDF_URL = "https://arxiv.org/pdf/1706.03762" # replace with your document
resp = httpx.post(
"https://api.docira.io/v1/parse?include_trace=true",
headers={"Authorization": "Bearer $DOCIRA_API_KEY"},
json={"url": PDF_URL, "output_mode": "markdown"},
timeout=60,
)
resp.raise_for_status()
data = resp.json()
for page in data["pages"]:
routing = page.get("routing", {})
print(
f"Page {page['page_number']}: "
f"tier={routing.get('tier')}, "
f"complexity={routing.get('complexity_score'):.2f}, "
f"confidence={routing.get('confidence'):.2f}"
)
print(page["content_markdown"])
print("---")For local files use POST /v1/parse/upload with multipart form data. See the API reference for structured cell output via output_mode=json.