I'm trying to identify and save all of the headlines on a specific site, and keep getting what I believe to be encoding errors.
The site is: http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm
the current code is:
holder = {}
url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()
soup = BeautifulSoup(url, 'lxml')
head1 = soup.find_all(['h1','h2','h3'])
print head1
holder["key"] = head1
The output of the print is:
[<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>]
I'm reasonably certain that those are unicode characters, but I haven't been able to figure out how to convince python to display them as the characters.
I have tried to find the answer elsewhere. The question that was more clearly on point was this one: Python and BeautifulSoup encoding issues
which suggested adding
soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))
however that gave me the same error that is mentioned in a comment ("AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup'") removing the second '.BeautifulSoup' resulted in a different error ("RuntimeError: maximum recursion depth exceeded while calling a Python object").
I also tried the answer suggested here: Chinese character encoding error with BeautifulSoup in Python?
by breaking up the creation of the object
html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)
but that also generated the recursion error. Any other tips would be most appreciated.
thanks