UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position

Question

When I try to extract some pattern from a tagged text in nltk, I have the error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128). Firstly I had not this error, but I got it only after installing some packages.

this is the code:

# -*- coding: utf-8 -*-
import codecs
import sys
import re
import sys
import nltk
from nltk.corpus import *

k =  nltk.corpus.brown.tagged_words('myfile')
for (w1,t1), (w2,t2) in nltk.bigrams(k):
    if t1 == 'NN' and  t2 == 'AJ':
       print w1, w2

this is the entire output of the code.

Traceback (most recent call last):
File "/home/fathi/egfe.py", line 12, in <module>
for (w1,t1), (w2,t2) in nltk.bigrams(k):
File "/usr/local/lib/python2.7/dist-packages/nltk/util.py", line 442, in bigrams
for item in ngrams(sequence, 2, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/nltk/util.py", line 419, in ngrams
history.append(next(sequence))
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in  iterate_from
tokens = self.read_block(self._stream)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/tagged.py", line 241, in read_block
for para_str in self._para_block_reader(stream):
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 564, in read_blankline_block
line = stream.readline()
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1095, in readline
new_chars = self._read(readsize)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1322, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1352, in _incr_decode
return self.decode(bytes, 'strict')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128)

Does `tagged_words` have an `encoding` option? If not, does it allow you to pass it already-opened files instead of filenames? — abarnert, Nov 17 '14 at 19:28
Also, what are the contents of `'myfile'`? My guess is that they're either UTF-8 or your platform's default character set (e.g., cp1252 for US Windows); if so you have to tell Python in some way, or it's going to assume they're in `sys.getdefaultencoding()`, which is usually ASCII. — abarnert, Nov 17 '14 at 19:29
Thanks for posting the full traceback. It probably doesn't help much in this case (although even that may be wrong; someone who knows NLTK better than me might find the answer because of it…), but it's always a good idea. — abarnert, Nov 17 '14 at 19:32
Did you mean to import sys twice? Also, no need to import all corpora if you only need Brown. Try updating NLTK, even to an alpha release. If that doesn't work, try using textclean. — Dan, Nov 17 '14 at 20:45

score 0 · Answer 1 · answered Nov 18 '14 at 09:05

0

The problem is that the ntlk version is not compatabile with the python version, so it requires an older version of the nltk toolkit.

answered Nov 18 '14 at 09:05

5555555555

81
4

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position

1 Answers1

Linked