2

I am using Beautiful Soup and urllib2 for collecting contents from internet. This is the code i am using.

from bs4 import BeautifulSoup
import urllib2

html = urllib2.urlopen('http://plrplr.com/33717/mp3-player-guide/').read()
soup = BeautifulSoup(html, "lxml")
contents = soup.find('div', {'class': 'entry-content'})
print contents

But I am getting results like this...

<div class="entry-content">
<p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? That’s where this MP3 player guide comes in. <br/>
Basically, there are 3 types of MP3 player based on capacity: – <br/>
1. Hard drive MP3 player <br/>
– highest capacity <br/>
– largest in size <br/>
– heavy <br/>
– often labeled as an “Jukebox MP3 player� <br/>
– has moving parts <br/>
– example: Apple iPod video, Sony Network Walkman NW-HD5 <br/>

There is problem when dealing with special charector.

How i can get exact source code like this...

    <div class="entry-content">
        <p>MP3 player, also well known as digital audio player has become a staple of our gadget life. There are many brands of MP3 players on the market today. So, which MP3 player are the most suitable for you? That&#8217;s where this MP3 player guide comes in. </br><br />
Basically, there are 3 types of MP3 player based on capacity: &#8211; </br><br />
1. Hard drive MP3 player </br><br />
&#8211; highest capacity </br><br />
&#8211; largest in size </br><br />
&#8211; heavy </br><br />
&#8211; often labeled as an &#8220;Jukebox MP3 player&#8221; </br><br />
&#8211; has moving parts </br><br />
&#8211; example: Apple iPod video, Sony Network Walkman NW-HD5 </br><br />

I am running this code in Windows 8 machine using Eclipse and pydev.

Jake
  • 155
  • 4
  • 14
  • Either the website provides invalid character encoding, or you should explicitly set it to UTF-8. The problem does not seem to be related to beautifulsoup, but this line: `html = urllib2.urlopen('http://plrplr.com/33717/mp3-player-guide/').read()` – wigy Mar 17 '15 at 15:05
  • Maybe this helps you: http://stackoverflow.com/questions/8101036/python-urllib2-utf-8-encoding – wigy Mar 17 '15 at 15:07
  • I just ran your code. It worked fine for me exactly as you had it. Although I would update `'div', {'class': 'entry-content'}` to `"div", class_="entry-content"`. What sort of terminal and character set are you using? And what version of python? – jmunsch Mar 17 '15 at 15:12
  • I am thinking it might be an environment thing. Perhaps this may be relevant: http://stackoverflow.com/questions/25346518/cant-make-eclipse-luna-pydev-console-use-utf-8 – jmunsch Mar 17 '15 at 15:25

1 Answers1

2

Probably what you are looking for is contents.prettify(formatter="html") to show entity codes instead of non-ascii letters?

I could not test that on my machine, but here are the docs I used: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters

wigy
  • 2,174
  • 19
  • 32