3

I only have a few weeks of python training, so I suspect that there's a simple solution to this problem. But for me it's quite frustrating and after working on this for several hours I now ask you for help!

The website I'm trying to scrape is well organized (see https://twam2dcppennla6s.onion.to/), and the code I've written scrapes about half of the 26 pages until I receive this error message:

Traceback (most recent call last):
  File "SR2works4real2.py", line 18, in <module>
    csvWriter.writerows(jsonObj['vendors'])
  File "/usr/lib/python2.7/csv.py", line 154, in writerows
    return self.writer.writerows(rows)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 8: ordinal not in range(128)

My code is:

import urllib2, json,csv
htmlTxt=""


urlpart1='https://twam2dcppennla6s.onion.to/vendors.php?_dc=1393967362998&start='
pageNum=0   
urlpart2='&limit=30&sort=%5B%7B%22property%22%3A%22totalFeedback%22%2C%22direction%22%3A%22DESC%22%7D%5D'
csvFile=open('S141.csv','wb')
csvWriter=csv.DictWriter(csvFile,['name','vendoringTime','lastSeen','avgFeedback','id','totalFeedback','united','shipsTo','shipsFrom'],delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
csvWriter.writeheader()

while htmlTxt != "{\"vendors\":[]}":
    print("Page "+str(pageNum)+"...")
    pageNum+=30
    response=urllib2.urlopen((urlpart1)+str(pageNum)+(urlpart2))
    htmlTxt=response.read()
    htmlTxt.encode('utf-8')
    jsonObj=json.loads(htmlTxt)
    csvWriter.writerows(jsonObj['vendors'])

    #print(str(jsonObj))

csvFile.close()

I hope there's someone out there that can help!

Isak
  • 535
  • 3
  • 6
  • 17
  • 1
    You're probably going to need to [`decode`](http://docs.python.org/2/library/stdtypes.html#str.decode) the characters that won't fit into ASCII. The [`csv`](http://docs.python.org/2/library/csv.html) module says it doesn't support Unicode, but it mentions that you can use utf8 so perhaps there's something else you could do with your `encode`-ing (but I'm not sure what). – 2rs2ts Mar 04 '14 at 22:19

2 Answers2

2

That is unicode for the Trademarked symbol: http://www.marathon-studios.com/unicode/U2122/Trade_Mark_Sign

Since you're scraping web, you'll likely see a lot more of these types of errors, so replacing it might work for this page, but not others with other symbols.

The csv module is converting your unicode to ascii before writing it. I'd recommend you do the same before giving it the text, and clean it up yourself, that is, instead of

htmlTxt.encode('utf-8')

do

htmlTxt.encode('ascii', 'ignore')

And then check out the text to see if it is acceptable for your purposes.

EDIT

Here's my output in Python 3:

>>> u'\u2122'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u2122' in position 0: ordinal not in range(128)
>>> u'\u2122'.encode('ascii', 'ignore')
b''

and Python 2.6:

>>> u'\u2122'.encode('ascii')
Traceback (most recent call last):
  File "<pyshell#92>", line 1, in <module>
    u'\u2122'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 0: ordinal not in range(128)
>>> u'\u2122'.encode('ascii', 'ignore')
''
Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
  • Thanks so much for helping out! I actually got a similar error message after doing this: Traceback (most recent call last): File "SR2works4real2.py", line 18, in csvWriter.writerows(jsonObj['vendors']) File "/usr/lib/python2.7/csv.py", line 154, in writerows return self.writer.writerows(rows) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 8: ordinal not in range(128) – Isak Mar 04 '14 at 22:53
  • @user3343907 I can see you're new to the site, Welcome to Stackoverflow, if you accept an answer by clicking the checkmark next to it, then you will automatically get +2 to your rep. – Russia Must Remove Putin Mar 04 '14 at 22:54
  • @user3343907 did you use the 'ignore' flag? this works for me in Python 2.6 and 3.3. – Russia Must Remove Putin Mar 04 '14 at 22:57
  • Hi! I keep getting the same error, after including htmlTxt.encode('ascii', 'ignore') and u'\u2122'.encode('ascii', 'ignore') in the code =( – Isak Mar 05 '14 at 15:16
  • This is the error message I get after adding the stuff in the comment above:Traceback (most recent call last): File "SR2.py", line 19, in csvWriter.writerows(jsonObj['vendors']) File "/usr/lib/python2.7/csv.py", line 154, in writerows return self.writer.writerows(rows) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 8: ordinal not in range(128) – Isak Mar 05 '14 at 15:29
  • The other solution works, thanks a lot for the trouble Aaron. – Isak Mar 06 '14 at 03:57
1

The strings in jsonObj will be in unicode type, because Python json module will produce unicode strings. Your csv writer wants everything in str type. In Python 2.7 it will try to automatically convert unicode type to str type assuming ASCII. This will of course fail if the unicode type does not contain ASCII.

The simplest fix would be to change this line:

csvWriter.writerows(jsonObj['vendors'])

to encode the unicode into str in utf8 just before sending to csv writer. jsonObj['vendors'] is a list of dictionaries with unicode keys and values, so we can do this:

unicode_vendors = jsonObj['vendors']
str_vendors = []
for unicode_dict in unicode_vendors:
    str_dict = {}
    for key, value in unicode_dict.items():
        str_dict[key.encode('utf8')] = value.encode('utf8') if value else value
    str_vendors.append(str_dict)
csvWriter.writerows(str_vendors)
Heikki Toivonen
  • 30,964
  • 11
  • 42
  • 44
  • Thanks for helping! I actually got an error message after doing this: Traceback (most recent call last): File "SR2works4real2.py", line 18, in csvWriter.writerows(jsonObj['vendors'].encode('utf8')) AttributeError: 'list' object has no attribute 'encode' – Isak Mar 04 '14 at 22:56
  • Ah, I didn't realize it was a list. I modified the answer which should fix the issue. – Heikki Toivonen Mar 05 '14 at 00:58
  • Thanks! I tried it and got this error message: Traceback (most recent call last): File "SR2works4real3.py", line 20, in str_vendors = [s.encode('utf8') for s in unicode_vendors] AttributeError: 'dict' object has no attribute 'encode' – Isak Mar 05 '14 at 16:06
  • Ok, I had to run your app to find out what data was in jsonObj['vendors']. It is a list of dictionaries where keys and values are unicode strings, and some values were also None. You could have easily found this out by printing some of those and included those in the question. Anyway, after running the app with those changes it seems to be working for me using Python 2.7.6 on Mac. – Heikki Toivonen Mar 05 '14 at 22:15
  • Fantstic, it works! Thanks a lot for your help Heikki! – Isak Mar 06 '14 at 01:24