Home Olmocr

Olmocr

A toolkit designed for converting and linearizing PDFs to create datasets optimized for large language model (LLM) training and evaluation.

Language
Python
Latest Release
v0.4.27
License
Apache License 2.0

Our Newsletter

Get new AI tools right in your inbox

Get short emails with useful ai projects, releases, and repos worth watching.


Key Features

  • PDF linearization for dataset creation
  • Optimized for large language model (LLM) workflows
  • Supports automated text extraction
  • Facilitates preparation of training datasets
  • Command-line utility for easy usage

Alternative Tools

pdftotextpdfminer.sixunstructuredpdfplumber


Community

Stars
17.1k
Open Issues
74
Forks
1.4k