1
import urllib, urllib2
from bs4 import BeautifulSoup, Comment 
strg=""
iter=1
url='http://www.amazon.in/product-reviews/B00EOPJEYK/ref=cm_cr_pr_top_link_1?    ie=UTF8&pageNumber=1&showViewpoints=0&sortBy=bySubmissionDateDescending'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, "html.parser")
rows =soup.find_all('div',attrs={"class" : "reviewText"})
for row in soup.find_all('div',attrs={"class" : "reviewText"}):
      strg = strg +str(iter)+"." + row.text + "\n\n"
      iter=iter+1

with open('outp.txt','w') as f:
      f.write(strg)
f.close()

I require this code to write the contents of the variable,strg to the file,outp.txt.

Instead I get this error:

Traceback (most recent call last):
File "C:\Python27\demo_amazon.py", line 14, in <module>
f.write(strg)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 226:   ordinal not in range(128)     

strg stores the required output.There is some problem in the writing statement I guess.How to solve this?

Kindly help.

Thank you.

Grijesh Chauhan
  • 57,103
  • 20
  • 141
  • 208
keshr3106
  • 413
  • 3
  • 10
  • 21

2 Answers2

2

well, first of all, if you want to get rid of the unicode errors, you shall switch to Python 3 that defaults to unicode strings instead of ascii strings in python 2.

That said, to get rid of the UnicodeEncodeError exception, you shall do:

with open('outp.txt','w') as f:
    f.write(strg.encode('utf8'))

as a reference, see that question. And try to use unicode strings as much as possible to avoid as much as possible changing charsets, by using u"this is an unicode string" instead of "this is an ascii string"

thus in your for loop:

  strg = strg +str(iter)+"." + row.text + "\n\n"

should instead be:

  strg = strg +unicode(iter)+u"." + row.text + u"\n\n"

and strg should be defined as strg = u""

N.B.: f.close() in your code is redundant with the use of the with keyword that actually takes care of closing the file when you exit the with block, through the __exit__() method of the File object.

Community
  • 1
  • 1
zmo
  • 24,463
  • 4
  • 54
  • 90
1

Basically you have a non-ASCII character. I suggest using Unidecode which will try and find the "closest" ASCII character to the offending one. So, for instance it would turn é into e.

So you'd just do

from unidecode import unidecode
f.write(unidecode(strg))
Jared Joke
  • 1,226
  • 2
  • 18
  • 29