Wiki source for Blog20151023UsingOCR


Show raw source

OCR is for optical character recognition. I used it to convert a PDF containing only images and no text :/

===Some links about OCR===
Scanning as image (in a pdf), then recognize characters
https://help.ubuntu.com/community/OCR
~- add `tesseract -l fra` to identify french for the language of the content

OCRfeeder did the job (convert PDF to text)

~- gocr
~- ocrfeeder
~- paperwork
~- simple-scan
~- tesseract
~- xsane

Other useful programs:
~- gscan2pdf
~- scantailor - interactive post-processing tool for scanned pages. It performs operations such as page splitting, deskewing, adding/removing borders, and others. You give it raw scans, and you get pages ready to be printed or assembled into a PDF or DJVU file. Scanning, optical character recognition, and assembling multi-page documents are out of scope of this project
~- skanlite
~- zbar - bar code scanner

=== processing articles ===
~- scan reliably to the size of the magazine/book
~~- 300 dpi in color, about 20 MB/page,
~~- only keep relevant surface scanned (up and down, eventually, same area). e.g. : 20 x 28 (margins can be spared...)
~~- reverse 180° some pages : a book may be best scanned up then down, same rulers horizontally/vertically, should be the same zone
~~- then have them compressed, can be stored as PNG in a PDF file (for example)
~- ocr (recognize characters) when possible, stored as a front layer (searchable)
~- add tags
~- possibly take into account corrections to improve the ocr part, test better scanning options


=== obsolete programs in 2018 ===
They are not packaged anymore for Mageia, phaps in need of a packager?
~- clara
~- ocrad
~- ocropus

Valid XHTML :: Valid CSS: :: Powered by WikkaWiki