4

I'm trying to identify and save all of the headlines on a specific site, and keep getting what I believe to be encoding errors.

The site is: http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm

the current code is:

holder = {}  

url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

soup = BeautifulSoup(url, 'lxml')

head1 = soup.find_all(['h1','h2','h3'])

print head1

holder["key"] = head1

The output of the print is:

[<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>]

I'm reasonably certain that those are unicode characters, but I haven't been able to figure out how to convince python to display them as the characters.

I have tried to find the answer elsewhere. The question that was more clearly on point was this one: Python and BeautifulSoup encoding issues

which suggested adding

soup = BeautifulSoup.BeautifulSoup(content.decode('utf-8','ignore'))

however that gave me the same error that is mentioned in a comment ("AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup'") removing the second '.BeautifulSoup' resulted in a different error ("RuntimeError: maximum recursion depth exceeded while calling a Python object").

I also tried the answer suggested here: Chinese character encoding error with BeautifulSoup in Python?

by breaking up the creation of the object

html = urllib2.urlopen("http://www.515fa.com/che_1978.html")
content = html.read().decode('utf-8', 'ignore')
soup = BeautifulSoup(content)

but that also generated the recursion error. Any other tips would be most appreciated.

thanks

Community
  • 1
  • 1
user5356756
  • 43
  • 1
  • 4
  • I had the same problem and tried this one, it works: :https://stackoverflow.com/a/65354890/20294353 – semui Oct 20 '22 at 17:52

2 Answers2

3

decode using unicode-escape:

In [6]: from bs4 import BeautifulSoup

In [7]: h = """<h3>\u73af\u5883\u6c61\u67d3\u6700\u5c0f\u5316 \u8d44\u6e90\u5229\u7528\u6700\u5927\u5316</h3>, <h1>\u5929\u6d25\u6ee8\u6d77\u65b0\u533a\uff1a\u697c\u5728\u666f\u4e2d \u5382\u5728\u7eff\u4e2d</h1>, <h2></h2>"""

In [8]: soup = BeautifulSoup(h, 'lxml')

In [9]: print(soup.h3.text.decode("unicode-escape"))
环境污染最小化 资源利用最大化

If you look at the source you can see the data is utf-8 encoded:

<meta http-equiv="content-language" content="utf-8" />

For me using bs4 4.4.1 just decoding what urllib returns works fine also:

In [1]: from bs4 import BeautifulSoup

In [2]: import urllib

In [3]: url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

In [4]: soup = BeautifulSoup(url.decode("utf-8"), 'lxml')

In [5]: print(soup.h3.text)
环境污染最小化 资源利用最大化

When you are writing to a csv you will want to encode the data to a utf-8 str:

 .decode("unicode-escape").encode("utf-8")

You can do the encode when you save the data in your dict.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Oh man this is so close! This works to print the text, which gives me hope that that data is correct. However, when I tried to add it to the dictionary it reverted back to the unicode. I broke up step 9 a bit so `g = soup.h3.text.encode("utf-8").decode("unicode-escape")` and then `print(g)`. That worked fine. But when I tried to add g to the dictionary called holder: `holder["key"] = g` and then `print holder` I got the unicode output again. Eventually I want to output the dictionary to CSV, and I want to make sure it is right through the chain. – user5356756 May 08 '16 at 20:56
  • @user5356756, that is just the repr representation http://stackoverflow.com/questions/1436703/difference-between-str-and-repr-in-python, try printing the values themselves from the dict and you should see the same. Also as per the end of the answer, you should really upgrade to bs4 – Padraic Cunningham May 08 '16 at 20:57
  • gotcha, thank you! That works. I'm running into trouble transferring the dictionary to a csv using dictwriter, but that's well beyond the scope of this question so I'll do some research and open a new one if need be. As for bs4, the first line of my script (which I didn't reproduce above) is `from bs4 import BeautifulSoup`. Is there something beyond that that I need to do in order to switch from 3 to 4? – user5356756 May 08 '16 at 21:14
  • ah ok, so you are using bs4, I was unsure when I saw `BeautifulSoup.BeautifulSoup` , out of interest if you `print(bs4.__version__)` what do you see? – Padraic Cunningham May 08 '16 at 21:16
  • I added that line to the bottom of my script and got the following error: `NameError: name 'bs4' is not defined` – user5356756 May 08 '16 at 21:24
  • You need to import bs4. Also to write to a file you should `.encode("utf-8")` – Padraic Cunningham May 08 '16 at 21:25
  • sorry - 4.4.1. Is it safe to assume that the `UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-14: ordinal not in range(128)` I get when I try and write the dict to a CSV with dictwriter is flowing from this same issue? – user5356756 May 08 '16 at 21:30
  • You should actually just need `.decode("unicode-escape").encode("utf-8")` each unicode string to write to the csv – Padraic Cunningham May 08 '16 at 21:32
  • sorry, I don't quite understand. if this is the csv code, where does the .decode... go? `with open('rrs_csv.csv', 'wb') as f: w = csv.DictWriter(f, holder.keys()) w.writeheader() w.writerow(holder)` – user5356756 May 08 '16 at 21:39
  • When you create the dict, replace the earlier logic with `h3.text.decode("unicode-escape").encode("utf-8")` etc... when you store the values and your will have no problem writing with the csv lib. We don't need the initial encode, I should have written it in reverse – Padraic Cunningham May 08 '16 at 21:40
  • 1
    OH MY GOD YOU ARE THE KING OF PYTHON THANK YOU – user5356756 May 08 '16 at 21:44
0

This may provide a pretty simple solution, not sure if it does absolutely everything you need it to though, let me know:

holder = {}  

url = urllib.urlopen('http://paper.people.com.cn/rmrb/html/2016-05/06/nw.D110000renmrb_20160506_2-01.htm').read()

soup = BeautifulSoup(url, 'lxml')

head1 = soup.find_all(['h1','h2','h3'])

print unicode(head1)

holder["key"] = head1

Reference: Python 2.7 Unicode

Josh Rumbut
  • 2,640
  • 2
  • 32
  • 43
  • Thanks! Unfortunately, that gave me the exact same output as before so I've still got u1234 instead of characters. – user5356756 May 08 '16 at 19:33