13

I'm having trouble displaying content, my program:

#! /usr/bin/python

import urllib
import re

url = "http://yahoo.com"
pattern = '''<span class="medium item-label".*?>(.*)</span>'''

website = urllib.urlopen(url)
pageContent = website.read()
result = re.findall(pattern, pageContent)

for record in result:
    print record

output:

Masked teen killed by dad
First look in &#39;Hotel of Doom&#39;
Ex-NFL QB&#39;s sad condition
Reporter ignores warning
Romney&#39;s low bar for debates

So the question is what should I include in my code in order to transform &#39 into characters

Vor
  • 33,215
  • 43
  • 135
  • 193
  • maybe duplicated with http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python – charlee Sep 28 '12 at 19:31

2 Answers2

18

In Python2:

In [16]: text = 'Ex-NFL QB&#39;s sad condition'

In [17]: import HTMLParser

In [18]: parser = HTMLParser.HTMLParser()

In [19]: parser.unescape(text)
Out[19]: u"Ex-NFL QB's sad condition"

In Python3:

import html.parser as htmlparser
parser = htmlparser.HTMLParser()
parser.unescape(text)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
11

The solution for Python 3,

import html
html.unescape(text)
TAbdiukov
  • 1,185
  • 3
  • 12
  • 25
Abhishek
  • 3,337
  • 4
  • 32
  • 51