3

When I try to extract some pattern from a tagged text in nltk, I have the error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128). Firstly I had not this error, but I got it only after installing some packages.

this is the code:

# -*- coding: utf-8 -*-
import codecs
import sys
import re
import sys
import nltk
from nltk.corpus import *

k =  nltk.corpus.brown.tagged_words('myfile')
for (w1,t1), (w2,t2) in nltk.bigrams(k):
    if t1 == 'NN' and  t2 == 'AJ':
       print w1, w2

this is the entire output of the code.

Traceback (most recent call last):
File "/home/fathi/egfe.py", line 12, in <module>
for (w1,t1), (w2,t2) in nltk.bigrams(k):
File "/usr/local/lib/python2.7/dist-packages/nltk/util.py", line 442, in bigrams
for item in ngrams(sequence, 2, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/nltk/util.py", line 419, in ngrams
history.append(next(sequence))
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in  iterate_from
tokens = self.read_block(self._stream)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/tagged.py", line 241, in read_block
for para_str in self._para_block_reader(stream):
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 564, in read_blankline_block
line = stream.readline()
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1095, in readline
new_chars = self._read(readsize)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1322, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 1352, in _incr_decode
return self.decode(bytes, 'strict')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 79: ordinal not in range(128)
5555555555
  • 81
  • 4
  • Does `tagged_words` have an `encoding` option? If not, does it allow you to pass it already-opened files instead of filenames? – abarnert Nov 17 '14 at 19:28
  • Also, what are the contents of `'myfile'`? My guess is that they're either UTF-8 or your platform's default character set (e.g., cp1252 for US Windows); if so you have to tell Python in some way, or it's going to assume they're in `sys.getdefaultencoding()`, which is usually ASCII. – abarnert Nov 17 '14 at 19:29
  • Thanks for posting the full traceback. It probably doesn't help much in this case (although even that may be wrong; someone who knows NLTK better than me might find the answer because of it…), but it's always a good idea. – abarnert Nov 17 '14 at 19:32
  • Did you mean to import sys twice? Also, no need to import all corpora if you only need Brown. Try updating NLTK, even to an alpha release. If that doesn't work, try using textclean. – Dan Nov 17 '14 at 20:45
  • which package did you install – alvas Nov 17 '14 at 21:58
  • The package installed is textblob – 5555555555 Nov 17 '14 at 22:43

1 Answers1

0

The problem is that the ntlk version is not compatabile with the python version, so it requires an older version of the nltk toolkit.

5555555555
  • 81
  • 4