I only have a few weeks of python training, so I suspect that there's a simple solution to this problem. But for me it's quite frustrating and after working on this for several hours I now ask you for help!
The website I'm trying to scrape is well organized (see https://twam2dcppennla6s.onion.to/), and the code I've written scrapes about half of the 26 pages until I receive this error message:
Traceback (most recent call last):
File "SR2works4real2.py", line 18, in <module>
csvWriter.writerows(jsonObj['vendors'])
File "/usr/lib/python2.7/csv.py", line 154, in writerows
return self.writer.writerows(rows)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 8: ordinal not in range(128)
My code is:
import urllib2, json,csv
htmlTxt=""
urlpart1='https://twam2dcppennla6s.onion.to/vendors.php?_dc=1393967362998&start='
pageNum=0
urlpart2='&limit=30&sort=%5B%7B%22property%22%3A%22totalFeedback%22%2C%22direction%22%3A%22DESC%22%7D%5D'
csvFile=open('S141.csv','wb')
csvWriter=csv.DictWriter(csvFile,['name','vendoringTime','lastSeen','avgFeedback','id','totalFeedback','united','shipsTo','shipsFrom'],delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL)
csvWriter.writeheader()
while htmlTxt != "{\"vendors\":[]}":
print("Page "+str(pageNum)+"...")
pageNum+=30
response=urllib2.urlopen((urlpart1)+str(pageNum)+(urlpart2))
htmlTxt=response.read()
htmlTxt.encode('utf-8')
jsonObj=json.loads(htmlTxt)
csvWriter.writerows(jsonObj['vendors'])
#print(str(jsonObj))
csvFile.close()
I hope there's someone out there that can help!