2

I am trying to use pypdfocr in Windows 7 with Python 2.7.

This is the ERROR Message I get when I try pypdfocr in cmd:

C:\Users\chamar.stu>pypdfocr F:\test2.pdf Starting conversion of F:\test2.pdf 'pdfimages' is not recognized as an internal or external command, operable program or batch file. WARNING: Could not execute pdfimages to calculate DPI (try installing xpdf or po ppler?), so defaulting to 300dpi Traceback (most recent call last): File "c:\users\chamar.stu\appdata\local\continuum\anaconda2\lib\runpy.py", line 174, in _run_module_as_main ... .... ....

pypdfocr\pypdfocr_tesseract.py", line 98, in _is_version_uptodate ver = [int(x) for x in ver_str.split('.')] ValueError: invalid literal for int() with base 10: '00alpha'

It seems that I am missing Poppler or XPDF but I did install Poppler via PyGoObject as suggested here. I've also link xpdf in my environmental path as suggested here.

Any suggestions to get me out of this little mess?

Community
  • 1
  • 1
Plug4
  • 3,838
  • 9
  • 51
  • 79

2 Answers2

1

The pypdfocr script is probably calling the pdfimages program (one of the poppler utilities, not the library) using the subprocess module.

I could not easily discern if the utilities were provided in the URI you mention.

If not, you can find pre-built ms-windows executables for the utilities e.g. here.

Make sure that the location where the poppler utilities are installed is in your PATH, so that pypdfocr can find it.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • OK thanks -- The link to the Popple .exe on the website is down.. I have to wait for it to re-up... – Plug4 Mar 17 '17 at 11:16
0

Try downgrading Tesseract from version 4.0.0-beta.1(my case) to version 3.x that doesn't contain alphanumericals in the name.

tesseract --version #to check

The version check built into the pypdfocr package is expecting the version numbers to be integers, hence the error on '00alpha' ('0-beta' in my case)

Eduard Florinescu
  • 16,747
  • 28
  • 113
  • 179