Issue with sentence splitting code in python

Question

Here is my code in python for sentence splitting

 import nltk.data
 tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
 fp = open("newoutput.en")
 data1 = fp.read()
 print '\n-----\n'.join(tokenizer.tokenize(data1))

but on executing it,I get the following error:

Traceback (most recent call last):
  File "pythontokeniser.py", line 7, in <module>
    print '\n-----\n'.join(tokenizer.tokenize(data1))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 313, in _pair_iter
    for el in it:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 312, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 581, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 546, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)***

And ? Have you searched "UnicodeDecodeError: 'ascii' codec can't decode byte" on the net ? — bruno desthuilliers, Feb 05 '18 at 12:14
Possible duplicate of [How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"](https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte) — Ori Marko, Apr 10 '18 at 06:01

score 0 · Answer 1 · answered Feb 05 '18 at 12:11

0

Open file in byte mode as below:

fp = open("newoutput.en", 'rb')

or try encoding using "ISO-8859-1".

answered Feb 05 '18 at 12:11

gB08

182
2
10

binary mode only works on Windows and won't solve an encoding problem on a text file anyway. And explicitely specifying the encoding (which is the thing to do) requires you do know the effective encoding used for the file else you will STILL have encoding issues, so advising to use any random encoding is just useless. – bruno desthuilliers Feb 05 '18 at 12:20
I faced similar problem, and solved it using reading file in binary mode and using the encoding which I've mentioned above. – gB08 Feb 05 '18 at 12:25
Solving a problem means you understand both the problem and the solution. IOW you did not "solve" anything, just managed to "kind of make it work" by mere accident. Next time you'll get the same problem with a file encoded in utf-16 or CP1252 or any of the numerous common encodings and you'll find out that "encoding using SO-8859-1" will not "solve" anything. – bruno desthuilliers Feb 05 '18 at 12:30
Ohh, can you suggest how track which encoding we should use based on the error or how to decide? – gB08 Feb 05 '18 at 14:37
If you don't know your file's encoding and it doesn't have a bom then you have a problem indeed. There are heuristics to try and guess (cf unicodedammit) but that's still a guess. – bruno desthuilliers Feb 06 '18 at 06:29

Issue with sentence splitting code in python

1 Answers1