Cannot convert HTML from a site to text properly

Question

EDIT: I cannot believe that BeautifullSoup actually cannot parse HTML properly. Actually i maybe do something wrong, but if I do not this is a really amateurish module.

I am trying to get text from web but i am unable to do so as i am always getting some strange characters in the most of sentences. I never get a sentence that containt words such as "isn't' correctly.

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.find_all('p')
mytext = ""
for par in textList:
    if len(str(par))<2000: 
    print par
    mytext +=" " +  str(par)

 print "the text is ", mytext

The result contains some strange characters:

The plural of â€œcomedoâ€? is comedomesâ€?.</p>
Surprisingly, the visible black head isnâ€™t caused by dirt

Obviously i want to get isn't instead of isnâ€™t. What should i do?

possible duplicate of [Decode HTML entities in Python string?](http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string) — icedtrees, Feb 28 '14 at 10:38
This isn't a duplicate. I first need to extract all the paragraphs. I think that the decoding deletes all the
tags. — Brana, Feb 28 '14 at 11:01
I would need something to tell beautifullsoup now to ruin my html. I cannot believe that such a reputable python module cannot properly parse html. — Brana, Feb 28 '14 at 11:05

icedtrees · Answer 1 · 2014-02-28T12:16:37.527

1

I believe the problem is with your system output encoding, which cannot output the encoded character properly since it's outside the displayed character range.

BeautifulSoup4 is meant to fully support HTML entities.

Notice the strange behaviour of these commands:

>python temp.py
...
ed a blackhead. The plural of ÔÇ£comedoÔÇØ is comedomesÔÇØ.</p>
...

>python temp.py > temp.txt

>cat temp.txt
....
ed a blackhead. The plural of "comedo" is comedomes".</p> <p> </p> <p>Blackheads is an open and wide
....

I suggest writing your output to a text file, or perhaps using a different terminal/changing your terminal settings to support a wider range of characters.

edited Feb 28 '14 at 12:16

answered Feb 28 '14 at 11:12

icedtrees

6,134
5
25
35

I am still getting - â€œcomedoâ€? is comedomesâ€ - Did you use python 2.7? – Brana Feb 28 '14 at 11:22
My python is messing up, I keep getting different outputs. Can you tell me what versions of python and beautifulsoup you have? – icedtrees Feb 28 '14 at 11:23
it is python 2.7.3 and bs4 – Brana Feb 28 '14 at 11:25
Do you see some strange - Â characters as well? – Brana Feb 28 '14 at 11:26
1

I saw the characters before, but now when I try to run your code, I can't see them anymore. Still trying to figure out what's going on – icedtrees Feb 28 '14 at 11:27
Thanks for the effort. – Brana Feb 28 '14 at 11:31
1

Where are you running Python from? after testing, I believe it is a problem with your character encoding. – icedtrees Feb 28 '14 at 12:08
Windows 7. Where do i set the character encoding? – Brana Feb 28 '14 at 12:21
Anyway i cannot believe that after 10 hours i am still not able to get a proper output. The best i could get if i use regex instead of BS are sentences with wrong quotes - best “medical grade facial” istead of best "medical grade facial". – Brana Feb 28 '14 at 12:23
I am also on Windows 7. The program runs fine on IDLE after a few tries. I can't get it to work in Command Prompt, but I can redirect the output using `>` to a text file (read edited answer), and it seems to encode fine there. – icedtrees Feb 28 '14 at 12:34
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/48685/discussion-between-icedtrees-and-brana) – icedtrees Feb 28 '14 at 12:40

score 0 · Answer 2 · edited Feb 28 '14 at 14:06

Since this is Python 2 the urllib.urlopen().read() call returns a string of bytes most likely encoded in UTF-8 - you can look at the HTTP headers to see the encoding if it is specifically included. I assumed UTF-8.

You fail to decode this external representation before you start handling the content, and this is only going to lead to tears. General rule: decode inputs immediately, encode only on output.

Here's your code in working form with only two modifications;

import urllib2
from BeautifulSoup import BeautifulSoup

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = unicode(myreq.read(), "UTF-8")

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.findAll('p')
mytext = ""
for par in textList:
    if len(str(par))<2000: 
      print par
      mytext +=" " +  str(par)

print "the text is ", mytext

All I have done is added unicode decoding of html and used soup.findAll() rather than soup.find_all().

Rather odd. When I run the above code then the HTML I see contains `The plural of “comedo” is comedomes”.` - which is to say that the left and right quote marks have been correctly dealt with. Access the headers using `myreq.headers.headers`. I didn't see much of value, but did confirm you are dealing with a UTF-8 stream. The content appears to have an opening quotation mark missing. — holdenweb, Feb 28 '14 at 13:46
Yes it is charset=UTF-8 as i see. Never mind i added the solution that works for me. I use bs4 which probably does something strange. — Brana, Feb 28 '14 at 14:04

Brana · Accepted Answer · 2014-03-02T15:16:58.263

This is a solution based on people's answers from here and my research.

import html2text
import urllib2
import re
import nltk

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()
html = html.decode("utf-8")


textList = re.findall(r'(?<=<p>).*?(?=</p>)',html, re.MULTILINE|re.DOTALL)
mytext = ""
for par in textList:
   if len(str(par))<2000: 
    par = re.sub('<[^<]+?>', '', par)
    mytext +=" " +  html2text.html2text(par)

 print "the text is ", mytext

Cannot convert HTML from a site to text properly

3 Answers3