3

Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python

In html sources, there are tons of chars like "&# 58;" or "&# 46;" (have to put space between &# and numbers or these chars would be considered as ":" or "."), my questions is, how do you convert them to what they supposed to be in python? Is there a built in method or something?

Hopefully somebody can help me out. Thanks

Community
  • 1
  • 1
Shane
  • 4,875
  • 12
  • 49
  • 87

2 Answers2

5

I am not sure there is built-in library or not, but here is quick and dirty way to do with regex

>>> import re
>>> re.sub("&#(\d+);",lambda x:unichr(int(x.group(1),10)),": or .")
u': or .'
YOU
  • 120,166
  • 34
  • 186
  • 219
2

Something like this will handle most entity definitions (assuming Python 2.x). It handles decimal, hex, and any named entities that are in htmlentitydefs.

import re
from htmlentitydefs import name2codepoint
EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    if isinstance(s, str):
        s = s.decode(encoding)
    return EntityPattern.sub(unescape, s)
Duncan
  • 92,073
  • 11
  • 122
  • 156