How do I convert characters like ":" to ":" in python?

Question

Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python

In html sources, there are tons of chars like "&# 58;" or "&# 46;" (have to put space between &# and numbers or these chars would be considered as ":" or "."), my questions is, how do you convert them to what they supposed to be in python? Is there a built in method or something?

Hopefully somebody can help me out. Thanks

score 5 · Accepted Answer · answered Feb 18 '11 at 11:52

5

I am not sure there is built-in library or not, but here is quick and dirty way to do with regex

>>> import re
>>> re.sub("&#(\d+);",lambda x:unichr(int(x.group(1),10)),"&#58; or &#46;")
u': or .'

answered Feb 18 '11 at 11:52

YOU

120,166
34
186
219

@YOU Can you show a sample, where do I input text in this? Sorry am new to python. – Abhishek Bhatia May 19 '15 at 22:03

score 2 · Answer 2 · answered Feb 18 '11 at 11:54

Something like this will handle most entity definitions (assuming Python 2.x). It handles decimal, hex, and any named entities that are in htmlentitydefs.

import re
from htmlentitydefs import name2codepoint
EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    if isinstance(s, str):
        s = s.decode(encoding)
    return EntityPattern.sub(unescape, s)

How do I convert characters like ":" to ":" in python?

2 Answers2