Some links about OCR
Scanning as image (in a pdf), then recognize charactershttps://help.ubuntu.com/community/OCR
- add `tesseract -l fra` to identify french for the language of the content
OCRfeeder did the job (convert PDF to text)
- gocr
- ocrfeeder
- paperwork
- simple-scan
- tesseract
- xsane
Other useful programs:
- gscan2pdf
- scantailor - interactive post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, and others. You give it raw scans, and you get pages ready to be printed or assembled into a PDF or DJVU file. Scanning, optical character recognition, and assembling multi-page documents are out of scope of this project
- skanlite
- zbar - bar code scanner
processing articles
- scan reliably to the size of the magazine/book
- 300 dpi in color, about 20 MB/page,
- only keep relevant surface scanned (up and down, eventually, same area). e.g. : 20 x 28 (margins can be spared...)
- reverse 180° some pages : a book may be best scanned up then down, same rulers horizontally/vertically, should be the same zone
- then have them compressed, can be stored as PNG in a PDF file (for example)
- ocr (recognize characters) when possible, stored as a front layer (searchable)
- add tags
- possibly take into account corrections to improve the ocr part, test better scanning options
obsolete programs in 2018
They are not packaged anymore for Mageia, phaps in need of a packager?- clara
- ocrad
- ocropus