2

There are a lot of similar questions and I have tried every possible solution but can't seem to work it out. This is my code and I am working on Name Entity Recognition using Stanford Tagger.

from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('stanford-ner\classifiers\english.all.3class.distsim.crf.ser.gz',
                   'stanford-ner\stanford-ner.jar', encoding='utf-8')
tuple_list = st.tag("Please pay €94 million.".split())
print(tuple_list)

This is the error I get.

Traceback (most recent call last):
File "C:/Users/Dell/PycharmProjects/CSSOP/ner2.py", line 4, in <module>
tuple_list = st.tag("He was the subject of the most expensive association football transfer when he moved from Manchester United to Real Madrid in 2009 in a transfer worth €94 million ($132 million).".split())
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tag\stanford.py", line 71, in tag
return sum(self.tag_sents([tokens]), []) 
File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tag\stanford.py", line 95, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 247: invalid start byte

Edit: This is not a file opening encoding issue as pointed in other similar question.

user6446052
  • 83
  • 1
  • 7
  • Possible duplicate of ['utf-8' codec can't decode byte 0x80](http://stackoverflow.com/questions/36825972/utf-8-codec-cant-decode-byte-0x80) – DYZ Apr 30 '17 at 07:04
  • Are you **certain** that the encoding is `'utf-8'`, and not (eg) `'Windows-1252'`? – PM 2Ring Apr 30 '17 at 08:44
  • The encoding is `cp1252`. `0x80` is the Euro character in that encoding. – alexis Apr 30 '17 at 09:44

1 Answers1

1

You are getting a decoding error, when the nltk's Stanford wrapper tries to read back in the output of the Stanford recognizer (which is a java program). Clearly the recognizer has managed to create an invalid utf-8 file. Evidently, it does not check the data you pass it before it writes it out, so the problem is only discovered when Python tries to read it back in.

Now, at the very top of this table you'll see that 0x80 is how the Windows 1252 codepage encodes the Euro symbol. The implication is clear: Your Python source uses the Windows 1252 encoding, so that's what your string literal contains. The right solution here would be to switch your editor to using UTF-8, and fix the encoding of your program.

This behavior would make sense if you're using Python 2; but your snippet seems to be Python 3 (function form of print), so please clarify before I venture an alternative fix.

alexis
  • 48,685
  • 16
  • 101
  • 161