1

I thought I had this, but then it all fell apart. I'm starting a scraper that pulls data from a chinese website. When I isolate and print the elements I am looking for everything works fine ("print element" and "print text"). However, when I add those elements to a dictionary and then print the dictionary (print holder), everything goes all "\x85\xe6\xb0" on me. Trying to .encode('utf-8') as part of the appending process just throws up new errors. This may not ultimately matter because it is just going to be dumped into a CSV, but it makes troubleshooting really hard. What am I doing when I add the element to the dictionary to mess up the encoding?

thanks!

from bs4 import BeautifulSoup
import urllib
#csv is for the csv writer
import csv

#intended data structure is list of dictionaries
# holder = [{'headline': TheHeadline, 'url': TheURL, 'date1': Date1, 'date2': Date2, 'date3':Date3}, {'headline': TheHeadline, 'url': TheURL, 'date1': Date1, 'date2': Date2, 'date3':Date3})


#initiates the dictionary to hold the output

holder = []

txt_contents = "http://sousuo.gov.cn/s.htm?q=&n=80&p=&t=paper&advance=true&title=&content=&puborg=&pcodeJiguan=%E5%9B%BD%E5%8F%91&pcodeYear=2016&pcodeNum=&childtype=&subchildtype=&filetype=&timetype=timeqb&mintime=&maxtime=&sort=pubtime&nocorrect=&sortType=1"

#opens the output doc
output_txt = open("output.txt", "w")

#opens the output doc
output_txt = open("output.txt", "w")

def headliner(url):


    #opens the url for read access
    this_url = urllib.urlopen(url).read()
    #creates a new BS holder based on the URL
    soup = BeautifulSoup(this_url, 'lxml')

    #creates the headline section
    headline_text = ''
    #this bundles all of the headlines
    headline = soup.find_all('h3')
    #for each individual headline....
    for element in headline:
            headline_text += ''.join(element.findAll(text = True)).encode('utf-8').strip()
            #this is necessary to turn the findAll output into text
            print element
            text = element.text.encode('utf-8')
            #prints each headline
            print text
            print "*******"
            #creates the dictionary for just that headline
            temp_dict = {}
            #puts the headline in the dictionary
            temp_dict['headline'] = text

            #appends the temp_dict to the main list
            holder.append(temp_dict)

            output_txt.write(str(text))
            #output_txt.write(holder)

headliner(txt_contents)
print holder

output_txt.close()
mweinberg
  • 161
  • 11

1 Answers1

4

The encoding isn't being messed up. It's just different ways of representing the same thing:

>>> s = '漢字'
>>> s
'\xe6\xbc\xa2\xe5\xad\x97'
>>> print(s)
漢字
>>> s.__repr__()
"'\\xe6\\xbc\\xa2\\xe5\\xad\\x97'"
>>> s.__str__()
'\xe6\xbc\xa2\xe5\xad\x97'
>>> print(s.__repr__())
'\xe6\xbc\xa2\xe5\xad\x97'
>>> print(s.__str__())
漢字

The last piece of the puzzle to know is that when you put an object in a container, it prints the repr to represent those objects inside the container in the container's representations:

>>> ls = [s]
>>> print(ls)
['\xe6\xbc\xa2\xe5\xad\x97']

Perhaps it will become more clear if we define our own custom object:

>>> class A(object):
...     def __str__(self):
...         return "str"
...     def __repr__(self):
...         return "repr"
...
>>> A()
repr
>>> print(A())
str
>>> ayes  = [A() for _ in range(5)]
>>> ayes
[repr, repr, repr, repr, repr]
>>> print(ayes[0])
str
>>>
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • If you're using unicode literals (`s = u'漢字'`), you'll get a `UnicodeEncodeError` if you do `s.__str__()`, but `__repr__` gives you the encoding and `print` formats it as you would expect. – TemporalWolf Apr 06 '17 at 00:59
  • Thanks! Does that mean that there isn't a way to make the print(ls) actually print(ls.__str__())? – mweinberg Apr 06 '17 at 01:01
  • 1
    @mweinberg it *is* printing `ls.__str__()`, it's just that `ls.__str__()` is using the `__repr__` of the objects it contains to construct the string! – juanpa.arrivillaga Apr 06 '17 at 01:02
  • 1
    @TemporalWolf yeah, I think that has something to do with the default encoding in Python 2 being `ascii`. It's been a while since I've used Python 2, so I don't remember the details exactly. – juanpa.arrivillaga Apr 06 '17 at 01:02
  • @TemporalWolf see [this](http://stackoverflow.com/a/17628350/5014455) answer, if you set the default encoding to `utf8` then `s.__str__()` shouldn't give you an error. – juanpa.arrivillaga Apr 06 '17 at 01:04
  • Gotcha. Maybe the better way to ask it is "Does that mean that there isn't a way to make the print(ls) actually print(ls.__str__()) without using the __repr__?" And my sense from your answer is "yes". It is, however, good to know that the info is at least in there. – mweinberg Apr 06 '17 at 01:05
  • @mweinberg yeah, there might be some roundabout hacky way, but likely no easy way. – juanpa.arrivillaga Apr 06 '17 at 01:06
  • "It's been a while since I've used Python 2" - does Python 3 fix this kind of thing? – mweinberg Apr 06 '17 at 01:06
  • @juanpa.arrivillaga Yeah, [Python 2 str vs unicode str is strange](http://stackoverflow.com/questions/18034272/python-str-vs-unicode-types) :) – TemporalWolf Apr 06 '17 at 01:07
  • 1
    @mweinberg No, you'll still get the `__repr__` of objects in a container in the container's `__str__`, but, it makes working with unicode and some of the things we were discussing with @TemporalWolf much, much smoother. – juanpa.arrivillaga Apr 06 '17 at 01:08
  • If you **have** to have it... this works for me `print u"{{{}}}".format(u', '.join([u"{}: {}".format(key, value) for key, value in dct.iteritems()]))`... we said it was hacky ;) – TemporalWolf Apr 06 '17 at 01:12
  • hahahahahahaha. Yes, that is a bit on the hacky side. I think I am convinced that I should just find another way to troubleshoot. Thanks both of you! – mweinberg Apr 06 '17 at 01:17