Home Olmocr
Olmocr
A toolkit designed for converting and linearizing PDFs to create datasets optimized for large language model (LLM) training and evaluation.
Language
Python
Latest Release
v0.4.12
License
Apache License 2.0
Key Features
- PDF linearization for dataset creation
- Optimized for large language model (LLM) workflows
- Supports automated text extraction
- Facilitates preparation of training datasets
- Command-line utility for easy usage
Alternative Tools
Resources
Community
Stars
16.2k
Open Issues
47
Forks
1.2k