8

I have a string with symbols like this:

'

That's an apostrophe apparently.

I tried saxutils.unescape() without any luck and tried urllib.unquote()

How can I decode this? Thanks!

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
rick
  • 4,103
  • 9
  • 37
  • 41

3 Answers3

2

Check out this question. What you're looking for is "html entity decoding". Typically, you'll find a function named something like "htmldecode" that will do what you want. Both Django and Cheetah provide such functions as does BeautifulSoup.

The other answer will work just great if you don't want to use a library and all the entities are numeric.

Community
  • 1
  • 1
easel
  • 3,982
  • 26
  • 28
  • thanks. what does Django have? because i looked in the docs but couldnt' find anything... – rick May 03 '09 at 03:55
  • It's called django.utils.html.escape, apparently. Check out the other stackoverflow question I linked for some more details. – easel May 03 '09 at 04:17
  • it looks like django.utils.html.escape only works to encode, not decode. i ended up using BeautifulSoup. thanks – rick May 04 '09 at 04:43
2

Try this: (found it here)

from htmlentitydefs import name2codepoint as n2cp
import re

def decode_htmlentities(string):
    """
    Decode HTML entities–hex, decimal, or named–in a string
    @see http://snippets.dzone.com/posts/show/4569

    >>> u = u'E tu vivrai nel terrore - L'aldilà (1981)'
    >>> print decode_htmlentities(u).encode('UTF-8')
    E tu vivrai nel terrore - L'aldilà (1981)
    >>> print decode_htmlentities("l'eau")
    l'eau
    >>> print decode_htmlentities("foo < bar")                
    foo < bar
    """
    def substitute_entity(match):
        ent = match.group(3)
        if match.group(1) == "#":
            # decoding by number
            if match.group(2) == '':
                # number is in decimal
                return unichr(int(ent))
            elif match.group(2) == 'x':
                # number is in hex
                return unichr(int('0x'+ent, 16))
        else:
            # they were using a name
            cp = n2cp.get(ent)
            if cp: return unichr(cp)
            else: return match.group()

    entity_re = re.compile(r'&(#?)(x?)(\w+);')
    return entity_re.subn(substitute_entity, string)[0]
Adrian Mester
  • 2,523
  • 1
  • 19
  • 23
1

The most robust solution seems to be this function by Python luminary Fredrik Lundh. It is not the shortest solution, but it handles named entities as well as hex and decimal codes.

John Y
  • 14,123
  • 2
  • 48
  • 72