Beautifulsoup special character parsing error

Question

I am using Beautiful Soup and urllib2 for collecting contents from internet. This is the code i am using.

from bs4 import BeautifulSoup
import urllib2

html = urllib2.urlopen('http://plrplr.com/33717/mp3-player-guide/').read()
soup = BeautifulSoup(html, "lxml")
contents = soup.find('div', {'class': 'entry-content'})
print contents

But I am getting results like this...

<div class="entry-content">
<p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? Thatâ€™s where this MP3 player guide comes in. <br/>
Basically, there are 3 types of MP3 player based on capacity: â€“ <br/>
1. Hard drive MP3 player <br/>
â€“ highest capacity <br/>
â€“ largest in size <br/>
â€“ heavy <br/>
â€“ often labeled as an â€œJukebox MP3 playerâ€? <br/>
â€“ has moving parts <br/>
â€“ example: Apple iPod video, Sony Network Walkman NW-HD5 <br/>

There is problem when dealing with special charector.

How i can get exact source code like this...

    <div class="entry-content">
        <p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? That&#8217;s where this MP3 player guide comes in. </br><br />
Basically, there are 3 types of MP3 player based on capacity: &#8211; </br><br />
1. Hard drive MP3 player </br><br />
&#8211; highest capacity </br><br />
&#8211; largest in size </br><br />
&#8211; heavy </br><br />
&#8211; often labeled as an &#8220;Jukebox MP3 player&#8221; </br><br />
&#8211; has moving parts </br><br />
&#8211; example: Apple iPod video, Sony Network Walkman NW-HD5 </br><br />

I am running this code in Windows 8 machine using Eclipse and pydev.

Either the website provides invalid character encoding, or you should explicitly set it to UTF-8. The problem does not seem to be related to beautifulsoup, but this line: `html = urllib2.urlopen('http://plrplr.com/33717/mp3-player-guide/').read()` — wigy, Mar 17 '15 at 15:05
Maybe this helps you: http://stackoverflow.com/questions/8101036/python-urllib2-utf-8-encoding — wigy, Mar 17 '15 at 15:07
I just ran your code. It worked fine for me exactly as you had it. Although I would update `'div', {'class': 'entry-content'}` to `"div", class_="entry-content"`. What sort of terminal and character set are you using? And what version of python? — jmunsch, Mar 17 '15 at 15:12
I am thinking it might be an environment thing. Perhaps this may be relevant: http://stackoverflow.com/questions/25346518/cant-make-eclipse-luna-pydev-console-use-utf-8 — jmunsch, Mar 17 '15 at 15:25

score 2 · Accepted Answer · answered Mar 17 '15 at 15:16

Probably what you are looking for is contents.prettify(formatter="html") to show entity codes instead of non-ascii letters?

I could not test that on my machine, but here are the docs I used: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters

Beautifulsoup special character parsing error

1 Answers1