TiffDjvuOcr



Overview:

GUI frontend to convert Scan Tailor tiff output to a OCR'ed, searchable djvu file.

 

Screenshots:

 

Supported OS:

Only tested in Windows XP.

by Nod5  -  Free Software GPL3  -  AutoHotkey

Known issues:
Old software, not tested in Windows 10 or with latest version of Tesseract.
The OCR step can in some cases miss a character which makes all subsequent OCR words one character off. That bug needs fixing for this tool to be fit for use again.

How to use:

Drag drop a file on a command.

The first command takes a .tiff as input,
operates on all .tiff in dropfile folder and
outputs an OCR'ed, searchable .djvu file.

- for use on .tiff from Scan Tailor
- operates on *all* .tiff in same folder as dropped file
- uses -lossy setting to minimize djvu file size

 

Dependencies: (try latest windows binary version):
1. DjvuLibre ,  djvu.sourceforge.net
2. Tesseract 3 ,  https://github.com/tesseract-ocr/tesseract
check ReadMe/FAQ on site; two downloads needed:
tesseract-3.00.win32.zip
eng.traineddata.gz (unpack and put in subfolder tesseract-ocr essdata )

 

Command line use:
TiffDjvuOcr.exe "C:.tif"     = all .tif in folder C: to .djvu with OCR
TiffDjvuOcr.exe noocr "C:.tif"  = all .tif in folder C: to .djvu
TiffDjvuOcr.exe "C:.djvu"     = do OCR on a.djvu
TiffDjvuOcr.exe gettif "C:.djvu"    = extract multipage .tif from a.djvu
TiffDjvuOcr.exe img "C:.jpg"    = single image file to .djvu
TiffDjvuOcr.exe join "C:.djvu"     = join all .djvu in C: into one
TiffDjvuOcr.exe noloss "C:.tiff"     = all .tif in folder C: to .djvu with no-loss setting (bigger file; use if smaller djvu get characters errors)

 

md5 hashes:

50bc4f32bd7e1b91311bf725a65dc416 TiffDjvuOcr.ahk
36d2633fdecbe4502fdbb49d0babed06 TiffDjvuOcr.exe

Changelog:
v110305 New commands: to .djvu no-loss , join .djvu , img to .djvu; Autohotkey_L compatible.
v101013 ImageMagick no longer needed; now using Tesseract 3; fixed error at ocr on pages with no text
v100605 Perl no longer needed for processing tesseract output (thanks ewemoa!)
v100404 first release

  • Version
  • Downloads 181
  • File Size
  • File Count 1
  • Create Date February 21, 2018
  • Last update 2018-02-21 17:00:21
  • Last Updated February 23, 2018