1

I'm having a unicode encode error with the following code for a simple web scraper.

print 'JSON scraper initializing'

from bs4 import BeautifulSoup
import json
import requests
import geocoder


# Set page variable
page = 'https://www.bandsintown.com/?came_from=257&page='
urlBucket = []
for i in range (1,3):
    uniqueUrl = page + str(i)
    urlBucket.append(uniqueUrl)

# Build response container
responseBucket = []

for i in urlBucket:
    uniqueResponse = requests.get(i)
    responseBucket.append(uniqueResponse)


# Build soup container
soupBucket = []
for i in responseBucket:
    individualSoup = BeautifulSoup(i.text, 'html.parser')
    soupBucket.append(individualSoup)


# Build events container
allSanFranciscoEvents = []
for i in soupBucket:
    script = i.find_all("script")[4]

    eventsJSON = json.loads(script.text)

    allSanFranciscoEvents.append(eventsJSON)


with open("allSanFranciscoEvents.json", "w") as writeJSON:
   json.dump(allSanFranciscoEvents, writeJSON, ensure_ascii=False)
print ('end')

The odd thing is the sometimes, this code works, and doesn't give an error. It has to do with the for i in range line of the code. For example, if I put in (2,4) for the range, it works fine. If I change it to 1,3, it reads:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 12: ordinal not in range(128)

Can anyone tell me how to fix this issue within my code? If I print allSanFranciscoEvents, it is reading in all the data, so I believe the issue is happening in the final piece of code, with the JSON dump. Thanks so much.

DiamondJoe12
  • 1,879
  • 7
  • 33
  • 81

2 Answers2

1

Best Fix

Use Python 3! Python 2 is going EOL very soon. New code written in legacy python today will have a very short shelf life.

The only thing I had to change to make your code work in python 3 was to call the print() function instead of the print keyword. Your example code then worked without any error.

Persisting with Python 2

The odd thing is the sometimes, this code works, and doesn't give an error. It has to do with the for i in range line of the code. For example, if I put in (2,4) for the range, it works fine.

That is because you are requesting different pages with those different ranges, and not every page has a character that can't be converted to str using the ascii codec. I had to go to page 5 of the response to get the same error that you did. In my case, it was the artist name, u'Mø' that caused the issue. So here's a 1 liner that reproduces the issue:

>>> str(u'Mø')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position 0: ordinal not in range(128)

Your error explicitly singles out the character u'\xe9':

>>> str(u'\xe9')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

Same issue, just different character. The character is Latin small letter e with acute. Python is trying to use the default encoding, 'ascii', to convert the Unicode string to str, but 'ascii' doesn't know what the code point is.

I believe the issue is happening in the final piece of code, with the JSON dump.

Yes, it is:

>>> with open('tmp.json', 'w') as f:
...     json.dump(u'\xe9', f, ensure_ascii=False)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 190, in dump
    fp.write(chunk)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

And from the traceback, you can see that it's actually coming from writing to the file (fp.write(chunk)).

file.write() writes a string to a file, but u'\xe9' is a unicode object. The error message: 'ascii' codec can't encode character... tells us that python is trying to encode that unicode object to turn it into a str type, so it can write it to the file. Calling encode on the unicode string uses the "default string encoding", which is defined here to be 'ascii'.

To fix, don't leave it up to python to use the default encoding:

>>> with open('tmp.json', 'w') as f:
...     json.dump(u'\xe9'.encode('utf-8'), f, ensure_ascii=False)
...
# No error :)

In your specific example, you can fix the intermittent error by changing this:

allSanFranciscoEvents.append(eventsJSON)

to this:

allSanFranciscoEvents.append(eventsJSON.encode('utf-8'))

That way, you are explicitly using the 'utf-8' codec to convert the Unicode strings to str, so that python doesn't try to apply the default encoding, 'ascii' when writing to the file.

SuperShoot
  • 9,880
  • 2
  • 38
  • 55
  • Thank you for the help, all. I've tried both of these solutions but unfortunately I can't get either working 100%. With this solution: allSanFranciscoEvents.append(eventsJSON.encode('utf-8')) - if I put this line of code in the 'for i in soupBucket' loop, I get an error: 'list' object has no attribute 'encode'. The first solution presented above - json.dump(unicode(allSanFranciscoEvents), writeJSON, ensure_ascii=False) - works, but it embeds the 'u tag in all of my data, which is present even when I import the JSON into a javascript framework. Any suggestions? Thanks again ! – DiamondJoe12 Feb 04 '19 at 04:54
  • That's because in using the `unicode()` solution you are writing the `__repr__()` of the `Unicode` instance to the json file, which includes the prefix `'u'`. Your example above with the one-line change I note above works for me, so I can't help you any further. – SuperShoot Feb 04 '19 at 06:40
0

eventsJSON is object it can't use eventsJSON.encode('utf-8'). For Python 2.7 to write the file in utf-8 or unicode you can use codecs or write it using binary or wb flag.

with open("allSanFranciscoEvents.json", "wb") as writeJSON:
   jsStr = json.dumps(allSanFranciscoEvents)
   # the decode() needed because we need to convert it to binary
   writeJSON.write(jsStr.decode('utf-8')) 
print ('end')

# and read it normally
with open("allSanFranciscoEvents.json", "r") as readJson:
    data = json.load(readJson)
    print(data[0][0]["startDate"])
    # 2019-02-04
cieunteung
  • 1,725
  • 13
  • 16
  • Ok. But then, if I use: json.dump(unicode(allSanFranciscoEvents), writeJSON, ensure_ascii=False) How do I deal with those 'u tags? Any way to easily remove them? – DiamondJoe12 Feb 04 '19 at 20:00
  • 1
    updated the answer, take a look. previous asnwer is not correct. – cieunteung Feb 04 '19 at 20:36
  • Thank you ! This is appearing to work now. What made it finally work? Was it the jsStr? And what is the functional difference between 'encode' and 'decode'? – DiamondJoe12 Feb 05 '19 at 05:06
  • 1
    as I said above, writing file in unicode solve the problem. jsStr is converted json string from from object. and for that different read [here](https://stackoverflow.com/questions/447107/) – cieunteung Feb 05 '19 at 06:57