UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code

Question

When I execute the below code

import networkx as nx
import numpy as np
from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

def textrank(document):
    sentence_tokenizer = PunktSentenceTokenizer()
    sentences = sentence_tokenizer.tokenize(document)

    bow_matrix = CountVectorizer().fit_transform(sentences)
    normalized = TfidfTransformer().fit_transform(bow_matrix)

    similarity_graph = normalized * normalized.T

    nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
    scores = nx.pagerank(nx_graph)
    return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

fp = open("QC")    
txt = fp.read()
sents = textrank(txt)
print sents

I get the following error

Traceback (most recent call last):
  File "Textrank.py", line 44, in <module>
    sents = textrank(txt)
  File "Textrank.py", line 10, in textrank
    sentences = sentence_tokenizer.tokenize(document)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
    for el in it:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1291, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1337, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 1472, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128)

I am executing the code in Ubuntu. To get the text, I referred this website https://uwaterloo.ca/institute-for-quantum-computing/quantum-computing-101. I created a file QC (not QC.txt) and copy pasted the data paragraph by paragraph to the file. Kindly help me resolve the error. Thank You

Welcome to Stackoverflow! Please look at https://stackoverflow.com/help/how-to-ask. Also, please search on Google or somewhere for the problem first before posting a question. — alvas, Apr 07 '17 at 07:03
Sorry, I was not able to understand the existing solutions. EDIT : I just rechecked the link and now it makes a bit sense. I could not figure out how the solution fits in. I am new to Python and have taken a dive into NLP so I get overwhelmed easily. Please excuse me for that. — Rupjit Chakraborty, Apr 07 '17 at 07:12

score -1 · Accepted Answer · answered Apr 07 '17 at 06:45

Please try if the following works for you.

import networkx as nx
import numpy as np
import sys

reload(sys)
sys.setdefaultencoding('utf8')

from nltk.tokenize.punkt import PunktSentenceTokenizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

def textrank(document):
    sentence_tokenizer = PunktSentenceTokenizer()
    sentences = sentence_tokenizer.tokenize(document)

    bow_matrix = CountVectorizer().fit_transform(sentences)
    normalized = TfidfTransformer().fit_transform(bow_matrix)

    similarity_graph = normalized * normalized.T

    nx_graph = nx.from_scipy_sparse_matrix(similarity_graph)
    scores = nx.pagerank(nx_graph)
    return sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

fp = open("QC")    
txt = fp.read()
sents = textrank(txt.encode('utf-8'))
print sents

Thank you very much. Once I get the sentences, I am printing them as `for s in sents:` `st = str(s[1])` `print st` When I print sents as it is a lot of unicode type things are displayed but when I convert them to string they go away. Why does this happen? — Rupjit Chakraborty, Apr 07 '17 at 07:10

UnicodeDecodeError: 'ascii' codec can't decode byte in Textranking code

1 Answers1