Home Olmocr

Olmocr

A toolkit designed for converting and linearizing PDFs to create datasets optimized for large language model (LLM) training and evaluation.

Language
Python
Latest Release
v0.4.12
License
Apache License 2.0

Key Features

  • PDF linearization for dataset creation
  • Optimized for large language model (LLM) workflows
  • Supports automated text extraction
  • Facilitates preparation of training datasets
  • Command-line utility for easy usage

Alternative Tools

pdftotextpdfminer.sixunstructuredpdfplumber


Community

Stars
16.2k
Open Issues
47
Forks
1.2k