0
text= textract.process("/Users/dg/Downloads/Data Wrangling/syllabi/82445.pdf") 

I tried to read this file, but it throws the following error:-

'charmap' codec can't decode byte 0x9d in position 6583: character maps to.

Why does it throw this error? How do I fix this ?

coderina
  • 1,583
  • 13
  • 22
Dgao
  • 31
  • 4
  • Does this answer your question? [UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to ](https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character) – Tomerikoo Feb 06 '21 at 13:21

3 Answers3

0

Regarding your question this error can be solved by doing :

You can do it in 2 ways:

The first: is by doing : r"THEPATH", what this will do is that it will read the file that you have inserted via the path, example: text = r"/Users/dg/Downloads/Data Wrangling/syllabi/82445.pdf"

or you can just put double "/", sucha as : "//Users//dg//Downloads//Data Wrangling//syllabi//82445.pdf"(this will work the same way.

Hopefully this helped you :), and feel free to ask any further questions

0

I could do it like this :

import os

file = open("/Users/dg/Downloads/Data Wrangling/syllabi/82445.pdf", "r")
text = file.read()
file.close
0

That's an encoding problem.

Textract uses chardet to detect the encoding of the pdf file (utf-8, latin1, cp1252, etc.). Detecting the encoding of a file is not always an easy task, and chardet can fail at detecting the encoding of the file. In your case, it seems that for this particular pdf file, it failed.

If you know the encoding of your file, then you could use the input_encoding parameter like this:

textract.process(filename, input_encoding="cp1252", output_encoding="utf8")

(see issue 309 in the links below)

Note that the encoding parameter specifies the output encoding, not the input encoding. So, writing

text = textract.process(filename, encoding='ascii')

means that you want to write the output file with ascii encoding. But it doesn't mean that ascii is the encoding of your input file.

A note about chardet: You can guess the encoding of a file like this with chardet:

import chardet
guessed_encoding = chardet.detect(file)
print(guessed_encoding)

And it will output something like this:

{'encoding': 'EUC-JP', 'confidence': 0.99}

Or:

{'encoding': 'EUC-JP', 'confidence': 0.24}

Here you can see tat there is a confidence key. In the first example, chardet is very confident that the encoding is EUC-JP, but that's not the case in the second example.

You could try to use chardet with the pdf file that causes problem and see what is its confidence score.

Useful links:

https://github.com/deanmalmgren/textract/issues/309

https://github.com/deanmalmgren/textract/issues/164

Rivers
  • 1,783
  • 1
  • 8
  • 27
  • thanks, I just followed your suggestion, and is shows me this "{'encoding': 'IBM866', 'confidence': 0.5119798157077455, 'language': 'Russian'}" but that definitely is a English file. I try to use encoding = 'IBM866', but it's still not work. – Dgao Feb 06 '21 at 20:19
  • As I said in my answer, you have to use the parameter named `input_encoding`, not the parameter named `encoding`. So to try with utf-8, you would write `extract.process(filename, input_encoding="utf8")`. Note too that the namming convention for utf-8 in `textract` is unclear. Generally in Python it's the string "utf-8", but on the github and the docs of `textract` I saw only "utf8" and "utf_8". So if "utf8" doesn't work try the others too. If it doesn't work, try with others encodings – Rivers Feb 07 '21 at 11:44
  • It will not work with `IBM866`, because this output is just telling you that it's the encoding that has been used by `textract,`, the one with which it failed. A score of confidence of `0.5` is a very low score. That means that's it's probably not the right encoding. So you have to try with another encoding. And if the above doens't work you could too try to use `process` and the `decode` explicitly as stated here: https://github.com/deanmalmgren/textract/issues/203, or even create an issue on the textract's github. – Rivers Feb 07 '21 at 11:56