0
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

This is the error I get when trying to clean a list of names I extract using spaCy from an html page.

My code:

import urllib
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.en import English
from __future__ import unicode_literals
nlp_toolkit = English()
nlp = spacy.load('en')

def get_text(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")

    # delete unwanted tags:
    for s in soup(['figure', 'script', 'style']):
        s.decompose()

    # use separator to separate paragraphs and subtitles!
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

    text = ''.join(article_soup)
    return text

# using spacy
def get_names(all_tags):
    names=[]
    for ent in all_tags.ents:
        if ent.label_=="PERSON":
            names.append(str(ent))
    return names

def cleaning_names(names):
    new_names = [s.strip("'s") for s in names] # remove 's' from names
    myset = list(set(new_names)) #remove duplicates
    return myset

def main():
    url = "http://www.bbc.co.uk/news/uk-politics-39784164"
    text=get_text(url)
    text=u"{}".format(text)
    all_tags = nlp(text)
    names = get_person(all_tags)
    print "names:"
    print names
    mynewlist = cleaning_names(names)
    print mynewlist

if __name__ == '__main__':
    main()

For this particular URL I get the list of names which includes characters like £ or $:

['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May']

And then the error:

Traceback (most recent call last) <ipython-input-19-8582e806c94a> in <module>()
     47 
     48 if __name__ == '__main__':
---> 49     main()

<ipython-input-19-8582e806c94a> in main()
     43     print "names:"
     44     print names
---> 45     mynewlist = cleaning_names(names)
     46     print mynewlist
     47 

<ipython-input-19-8582e806c94a> in cleaning_names(names)
     31 
     32 def cleaning_names(names):
---> 33     new_names = [s.strip("'s") for s in names] # remove 's' from names
     34     myset = list(set(new_names)) #remove duplicates
     35     return myset

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

I tried different ways of fixing unicode (including sys.setdefaultencoding('utf8')), nothing worked. I hope someone had the same issue before and will be able to suggest a fix. Thank you!

aviss
  • 2,179
  • 7
  • 29
  • 52
  • Clean your traceback. It is unreadable. – keepAlive May 07 '17 at 16:33
  • Not sure where the error occurs and will not reproduce because of the libraries. Does it work if you fix the list of names manually? – handle May 07 '17 at 16:36
  • 1
    Have you checked the **Related** questions, shown on the right? – handle May 07 '17 at 16:48
  • I checked the related questions, and couldn't find a solution for my case. I also tried to manipulate the list of names before passing it to the cleaning function but decoding and encoding it again didn't help. – aviss May 07 '17 at 17:02
  • Change this `text=u"{}".format(text)` to use `decode(...)` instead. – stovfl May 07 '17 at 17:44
  • I tried this before I got this error instead: TypeError: Argument 'string' has incorrect type (expected unicode, got str) – aviss May 07 '17 at 18:08
  • `from __future___ import ...` line won't work if it's not at the start of the script. – alvas May 10 '17 at 03:08
  • Possible duplicate of [How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"](http://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte) – alvas May 17 '17 at 14:03

3 Answers3

1

When you get an decoding error with the 'ascii' codec, that's usually an indication that a byte string is being used in a context where a Unicode string is required (in Python 2, Python 3 won't allow it at all).

Since you've imported from __future__ import unicode_literals, the string "'s" is Unicode. This means the string you're trying to strip must be a Unicode string too. Fix that and you won't get the error anymore.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
1

As @MarkRansom commented ignoring non-ascii character is going to bite you back.

First take a look at

Also, note this is an anti-pattern: Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?

Easiest solution is to just use Python3 and that'll reduce some pain

>>> import requests
>>> from bs4 import BeautifulSoup
>>> import spacy
>>> nlp = spacy.load('en')

>>> url = "http://www.bbc.co.uk/news/uk-politics-39784164"
>>> html = requests.get(url).content
>>> bsoup = BeautifulSoup(html, 'html.parser')
>>> text = '\n'.join(p.text for d in bsoup.find_all( 'div', {'class': 'story-body__inner'}) for p in d.find_all('p') if p.text.strip())

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(text)
>>> names = [ent for ent in doc.ents if ent.ent_type_ == 'PERSON']

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
0

I finally fixed my code. I am surprised how easy it looks but it took me so long to get there and I saw so many people puzzled by the same problem so I decided to post my answer.

Adding this small function before passing names for further cleaning solved my problem.

def decode(names):        
    decodednames = []
    for name in names:
        decodednames.append(unicode(name, errors='ignore'))
    return decodednames

SpaCy still thinks that £59bn is a PERSON but it's ok with me, I can deal with this later in my code.

The working code:

import urllib
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.en import English
from __future__ import unicode_literals
nlp_toolkit = English()
nlp = spacy.load('en')

def get_text(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")

    # delete unwanted tags:
    for s in soup(['figure', 'script', 'style']):
        s.decompose()

    # use separator to separate paragraphs and subtitles!
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]

    text = ''.join(article_soup)
    return text

# using spacy
def get_names(all_tags):
    names=[]
    for ent in all_tags.ents:
        if ent.label_=="PERSON":
            names.append(str(ent))
    return names

def decode(names):        
    decodednames = []
    for name in names:
        decodednames.append(unicode(name, errors='ignore'))
    return decodednames

def cleaning_names(names):
    new_names = [s.strip("'s") for s in names] # remove 's' from names
    myset = list(set(new_names)) #remove duplicates
    return myset

def main():
    url = "http://www.bbc.co.uk/news/uk-politics-39784164"
    text=get_text(url)
    text=u"{}".format(text)
    all_tags = nlp(text)
    names = get_person(all_tags)
    print "names:"
    print names
    decodednames = decode(names)
    mynewlist = cleaning_names(decodednames)
    print mynewlist

if __name__ == '__main__':
    main()

which gives me this with no errors:

names: ['Nick Clegg', 'Brexit', '\xc2\xa359bn', 'Theresa May', 'Brexit', 'Brexit', 'Mr Clegg', 'Mr Clegg', 'Mr Clegg', 'Brexit', 'Mr Clegg', 'Theresa May'] [u'Mr Clegg', u'Brexit', u'Nick Clegg', u'59bn', u'Theresa May']

aviss
  • 2,179
  • 7
  • 29
  • 52
  • 1
    Sure, you can simply ignore all characters that aren't ASCII, that's easy. It will probably come back to bite you later though. The proper way to do the conversion is to let the libraries do it for you, because they know the appropriate encoding and you don't. – Mark Ransom May 09 '17 at 17:11