Using tesseract in Python3 textract library

Question

I would like to extract text from PDF files. I could successfully install tesseract (it works in Terminal) and textract (following this instruction).

However, when I run the code, I got an error.

text = textract.process(
    '/Users/Text/en.pdf',
    method='tesseract',
    language='eng',
)

Error is:

/usr/local/lib/python3.4/site-packages/textract-1.4.0-py3.4.egg/textract/parsers/pdf_parser.py in extract_tesseract(self, filename, **kwargs)
     62                 page_content = TesseractParser().extract(page_path, **kwargs)
     63                 contents.append(page_content)
---> 64             return ''.join(contents)
     65         finally:
     66             shutil.rmtree(temp_dir)

TypeError: sequence item 0: expected str instance, bytes found

I tried several modifications, but they never work and I got the same error.

return b''.join(contents)
Insert contents = [str(item) for item in contents] before return
Insert contents = [item.decode("utf-8") for item in contents] before return

You were absolutely right - my apologies. I somehow missed that line in your modifications. I tested your example here - tesseract 3.03 and leptonica 1.70, on a Slackware Linux system, and it seems to run without problem from the command line. Did you try that? `tesseract test_text.png out` — jcoppens, Jul 16 '16 at 00:46
@jcoppens Thanks for checking. Actually, I also tried in Command Line. But since I need to edit the text in Python after extraction, the most convenient way is doing everything in Python. — user51966, Jul 16 '16 at 12:56
I believe I tried once to use the tesseract wrapper, but couldn't get it to work properly. In this answer: http://stackoverflow.com/questions/29923827/extract-cow-number-from-image/29927200#29927200, I've called tesseract 'manually' from the program, and its output is then processed inside Python. Don't panic - the code is long because it also includes GUI code. The call is in the first class (RecognizeDigits) - the two lines of code starting with `out`. — jcoppens, Jul 16 '16 at 14:06

score 2 · Accepted Answer · edited Apr 13 '17 at 12:52

Actually, I did the same question in Japanese Stackoverflow (スタックオーバーフロー) and got the solution. The following is my translation of the core part. (thanks, @mjy).

Note: This modification works at least for English.

In line 64 of pdf_parser.py
Change return ''.join(contents) to

return "".join(item.decode('utf-8') if isinstance(item, bytes) else item for item in contents)

However, another error occurs.

NameError: name 'unicode' is not defined
In line 54 of utils.py, change if isinstance(text, unicode): (...cont...) to
```
if isinstance(text, str):
    return text
```

Using tesseract in Python3 textract library

1 Answers1