That's an encoding problem.
Textract
uses chardet
to detect the encoding of the pdf file (utf-8, latin1, cp1252, etc.). Detecting the encoding of a file is not always an easy task, and chardet
can fail at detecting the encoding of the file. In your case, it seems that for this particular pdf file, it failed.
If you know the encoding of your file, then you could use the input_encoding
parameter like this:
textract.process(filename, input_encoding="cp1252", output_encoding="utf8")
(see issue 309 in the links below)
Note that the encoding
parameter specifies the output encoding, not the input encoding.
So, writing
text = textract.process(filename, encoding='ascii')
means that you want to write the output file with ascii encoding. But it doesn't mean that ascii is the encoding of your input file.
A note about chardet
:
You can guess the encoding of a file like this with chardet
:
import chardet
guessed_encoding = chardet.detect(file)
print(guessed_encoding)
And it will output something like this:
{'encoding': 'EUC-JP', 'confidence': 0.99}
Or:
{'encoding': 'EUC-JP', 'confidence': 0.24}
Here you can see tat there is a confidence
key. In the first example, chardet
is very confident that the encoding is EUC-JP
, but that's not the case in the second example.
You could try to use chardet with the pdf file that causes problem and see what is its confidence score.
Useful links:
https://github.com/deanmalmgren/textract/issues/309
https://github.com/deanmalmgren/textract/issues/164