Home Olmocr
Olmocr
A toolkit designed for converting and linearizing PDFs to create datasets optimized for large language model (LLM) training and evaluation.
Language
Python
Latest Release
v0.4.27
License
Apache License 2.0
Our Newsletter
Get new AI tools right in your inbox
Get short emails with useful ai projects, releases, and repos worth watching.
Key Features
- PDF linearization for dataset creation
- Optimized for large language model (LLM) workflows
- Supports automated text extraction
- Facilitates preparation of training datasets
- Command-line utility for easy usage
Alternative Tools
Resources
Community
Stars
17.1k
Open Issues
74
Forks
1.4k