An intelligent PDF extraction router that classifies pages and routes them to optimal processing backends (PyMuPDF, Docling, OCR). Includes confidence scoring and auto-reextraction to prevent silent failures in RAG pipelines.

What it does

pdfmux is a PDF extraction router that automatically classifies each page in a PDF document (digital, scanned, or table-based) and routes it to the most appropriate processing backend. It uses PyMuPDF, Docling, OCR, and optional LLM fallback to extract content with maximum accuracy. The tool includes per-page confidence scoring that flags low-quality extractions and automatically re-processes them, preventing silent failures in RAG (Retrieval-Augmented Generation) systems.

How to set up

Installation is straightforward with zero configuration required. Simply run pip install pdfmux to install the package. The tool comes with a built-in MCP (Model Context Protocol) server, making it easy to integrate into AI workflows and applications without additional setup steps.

Key features

Automatic page classification (digital, scanned, tables)
Intelligent routing to optimal extraction backends
Per-page confidence scoring with quality flagging
Automatic re-extraction of low-quality pages
Built-in MCP server for easy integration
OCR support for scanned documents
Zero-configuration installation
Prevents silent RAG failures through quality monitoring

NameetP/pdfmux

What it does

How to set up

Key features

Похожие навыки

Monday.com MCP Server

Sentry MCP Server

Cloudflare MCP Server