I thought I had this, but then it all fell apart. I'm starting a scraper that pulls data from a chinese website. When I isolate and print the elements I am looking for everything works fine ("print element" and "print text"). However, when I add those elements to a dictionary and then print the dictionary (print holder), everything goes all "\x85\xe6\xb0" on me. Trying to .encode('utf-8') as part of the appending process just throws up new errors. This may not ultimately matter because it is just going to be dumped into a CSV, but it makes troubleshooting really hard. What am I doing when I add the element to the dictionary to mess up the encoding?
thanks!
from bs4 import BeautifulSoup
import urllib
#csv is for the csv writer
import csv
#intended data structure is list of dictionaries
# holder = [{'headline': TheHeadline, 'url': TheURL, 'date1': Date1, 'date2': Date2, 'date3':Date3}, {'headline': TheHeadline, 'url': TheURL, 'date1': Date1, 'date2': Date2, 'date3':Date3})
#initiates the dictionary to hold the output
holder = []
txt_contents = "http://sousuo.gov.cn/s.htm?q=&n=80&p=&t=paper&advance=true&title=&content=&puborg=&pcodeJiguan=%E5%9B%BD%E5%8F%91&pcodeYear=2016&pcodeNum=&childtype=&subchildtype=&filetype=&timetype=timeqb&mintime=&maxtime=&sort=pubtime&nocorrect=&sortType=1"
#opens the output doc
output_txt = open("output.txt", "w")
#opens the output doc
output_txt = open("output.txt", "w")
def headliner(url):
#opens the url for read access
this_url = urllib.urlopen(url).read()
#creates a new BS holder based on the URL
soup = BeautifulSoup(this_url, 'lxml')
#creates the headline section
headline_text = ''
#this bundles all of the headlines
headline = soup.find_all('h3')
#for each individual headline....
for element in headline:
headline_text += ''.join(element.findAll(text = True)).encode('utf-8').strip()
#this is necessary to turn the findAll output into text
print element
text = element.text.encode('utf-8')
#prints each headline
print text
print "*******"
#creates the dictionary for just that headline
temp_dict = {}
#puts the headline in the dictionary
temp_dict['headline'] = text
#appends the temp_dict to the main list
holder.append(temp_dict)
output_txt.write(str(text))
#output_txt.write(holder)
headliner(txt_contents)
print holder
output_txt.close()