2

I would like to extract text from PDF files. I could successfully install tesseract (it works in Terminal) and textract (following this instruction).

However, when I run the code, I got an error.

text = textract.process(
    '/Users/Text/en.pdf',
    method='tesseract',
    language='eng',
)

Error is:

/usr/local/lib/python3.4/site-packages/textract-1.4.0-py3.4.egg/textract/parsers/pdf_parser.py in extract_tesseract(self, filename, **kwargs)
     62                 page_content = TesseractParser().extract(page_path, **kwargs)
     63                 contents.append(page_content)
---> 64             return ''.join(contents)
     65         finally:
     66             shutil.rmtree(temp_dir)

TypeError: sequence item 0: expected str instance, bytes found

I tried several modifications, but they never work and I got the same error.

  1. return b''.join(contents)
  2. Insert contents = [str(item) for item in contents] before return
  3. Insert contents = [item.decode("utf-8") for item in contents] before return
user51966
  • 967
  • 3
  • 9
  • 21
  • You were absolutely right - my apologies. I somehow missed that line in your modifications. I tested your example here - tesseract 3.03 and leptonica 1.70, on a Slackware Linux system, and it seems to run without problem from the command line. Did you try that? `tesseract test_text.png out` – jcoppens Jul 16 '16 at 00:46
  • BTW, I removed my reply to avoid confusion... – jcoppens Jul 16 '16 at 00:46
  • @jcoppens Thanks for checking. Actually, I also tried in Command Line. But since I need to edit the text in Python after extraction, the most convenient way is doing everything in Python. – user51966 Jul 16 '16 at 12:56
  • I believe I tried once to use the tesseract wrapper, but couldn't get it to work properly. In this answer: http://stackoverflow.com/questions/29923827/extract-cow-number-from-image/29927200#29927200, I've called tesseract 'manually' from the program, and its output is then processed inside Python. Don't panic - the code is long because it also includes GUI code. The call is in the first class (RecognizeDigits) - the two lines of code starting with `out`. – jcoppens Jul 16 '16 at 14:06

1 Answers1

2

Actually, I did the same question in Japanese Stackoverflow (スタックオーバーフロー) and got the solution. The following is my translation of the core part. (thanks, @mjy).

Note: This modification works at least for English.

  1. In line 64 of pdf_parser.py
    Change return ''.join(contents) to

    return "".join(item.decode('utf-8') if isinstance(item, bytes) else item for item in contents)
    
  2. However, another error occurs.

    NameError: name 'unicode' is not defined

  3. In line 54 of utils.py, change if isinstance(text, unicode): (...cont...) to

    if isinstance(text, str):
        return text
    
Community
  • 1
  • 1
user51966
  • 967
  • 3
  • 9
  • 21