0

I need to replace all the ascii symbols other than alphabets into HTML number (http://www.ascii.cl/htmlcodes.htm). From this post(Convert HTML entities to Unicode and vice versa), I could use this code, but I still can't get * (or maybe many other characters) working.

What could be the solution? Just simple replacements could be the only solution?

>>> from BeautifulSoup import BeautifulStoneSoup as bs
>>> import cgi
>>> cgi.escape("<*>").encode('ascii', 'xmlcharrefreplace')

'&lt;*&gt;'
Community
  • 1
  • 1
prosseek
  • 182,215
  • 215
  • 566
  • 871
  • 1
    Why would `*` get replaced? It's not special in this context. There is no [html entity](http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references) for `*`. – Carsten Apr 18 '15 at 22:47

1 Answers1

1

Your question is a bit vague. I will assume that by "alphabets" you mean all characters from a-z and their uppercase variants. Then you can achieve the desired result using a regular expression:

>>> f = lambda s: re.sub(r'([^a-zA-Z])', lambda x: '&#{};'.format(ord(x.group(0))), s)
>>> f("<hi>")
'&#60;hi&#62;'
>>> f("<*>")
'&#60;&#42;&#62;'

Please note that, without knowing about your special application, this looks like a weird thing to do. There might be a better approach to solve the real underlying problem.

Carsten
  • 17,991
  • 4
  • 48
  • 53