3

When I'm processing HTML code in Python I have to use the following code because of special characters.

line = string.replace(line, """, "\"")
line = string.replace(line, "'", "'")
line = string.replace(line, "&", "&")
line = string.replace(line, "&lt;", "<")
line = string.replace(line, "&gt;", ">")
line = string.replace(line, "&laquo;", "<<")
line = string.replace(line, "&raquo;", ">>")
line = string.replace(line, "&#039;", "'")
line = string.replace(line, "&#8220;", "\"")
line = string.replace(line, "&#8221;", "\"")
line = string.replace(line, "&#8216;", "\'")
line = string.replace(line, "&#8217;", "\'")
line = string.replace(line, "&#9632;", "")
line = string.replace(line, "&#8226;", "-")

It seems there will be much more such special characters I have to replace. Do you know how to make this code more elegant?

thank you

xralf
  • 3,312
  • 45
  • 129
  • 200
  • possible duplicate of [Decode HTML entities in Python string?](http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string) – Ben James Jul 31 '11 at 11:30
  • 1
    `string.replace` and most similar functions in the `string` module are deprecated: http://docs.python.org/library/string.html#deprecated-string-functions – Rosh Oxymoron Jul 31 '11 at 11:36
  • @Ben James thank you, this solution is suitable for me, but it's not a duplicate because I might want to make another sequence of replacements (e.g. one 1000 of replacements according to something else than HTML special characters) – xralf Jul 31 '11 at 11:38
  • @xralf fair enough, it is indeed a valid question on its own independent of the reference to HTML entities. – Ben James Jul 31 '11 at 11:41
  • 2
    Use `HTMLParser.unescape()` to convert entity and character references to the real characters they refer to. I wouldn't try to replace non-ASCII characters with ASCII that ‘looks a bit like’ them, like replacing `»` with `>>`... there are far too many characters to handle; much better to just allow the Unicode to work. – bobince Jul 31 '11 at 11:46
  • @bobince Thank you, it's better than to use BeautifulSoup for this purpose, because beautiful soup will delete normal HTML tags too, which is undesirable when you're processing text which is talking about HTML code. – xralf Jul 31 '11 at 12:15

3 Answers3

4
REPLACEMENTS = [
    ("&quot;", "\""),
    ("&apos;", "'"),
    ...
    ]
for entity, replacement in REPLACEMENTS:
    line = line.replace(entity, replacement)

Note that string.replace is simply available as a method on str/unicode objects.

Better yet, check out this question!

The title of your question asks something different, though: optimization, i.e. making it run faster. That's a completely different problem, and will require more work.

Community
  • 1
  • 1
Thomas
  • 174,939
  • 50
  • 355
  • 478
  • You're right I will change the word "optimization" to "readability" – xralf Jul 31 '11 at 11:41
  • 2
    Beware that the order of replacements is always important, when you replace `"&"` with `"&"` it might turn `"&lt;"` in `<` if you do it in the wrong order. If you have a common replacement pattern, you might look for it with a `re.sub` and use a function to get the replacement (work for things that are like HTML entities). – Rosh Oxymoron Jul 31 '11 at 11:45
2

Here's some code I wrote a while back to decode HTML entities. Note that it is for Python 2.x so it also decodes from str to unicode: you can drop that bit if you are using a modern Python. I think it handles any of the named entities, decimal and hex entities. For some reason 'apos' isn't in Python's dictionary of named entities so I copy it first and add the missing one:

from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    if isinstance(s, str):
        s = s.decode(encoding)
    return EntityPattern.sub(unescape, s)
Duncan
  • 92,073
  • 11
  • 122
  • 156
2

Optimization

REPL_tu = (("&quot;", "\"") , ("&apos;", "'") , ("&amp;", "&") ,
           ("&lt;", "<") , ("&gt;", ">") ,
           ("&laquo;", "<<") , ("&raquo;", ">>") ,
           ("&#039;", "'") ,
           ("&#8220;", "\"") , ("&#8221;", "\"") ,
           ("&#8216;", "\'") , ("&#8217;", "\'") ,
           ("&#9632;", "") , ("&#8226;", "-")     )

def repl(mat, d = dict(REPL_tu)):
    return d[mat.group()]

import re
regx = re.compile('|'.join(a for a,b in REPL_tu))

line = 'A tag &lt;bidi&gt; has a &quot;weird&#8220;&#8226;&apos;content&apos;'
modline = regx.sub(repl,line)
print 'Exemple:\n\n'+line+'\n'+modline 








from urllib import urlopen

print '\n-----------------------------------------\nDownloading a web source:\n'
sock = urlopen('http://www.mythicalcreaturesworld.com/greek-mythology/monsters/python-the-serpent-of-delphi-%E2%80%93-python-the-guardian-dragon-and-apollo/')
html_source = sock.read()
sock.close()

from time import clock

n = 100

te = clock()
for i in xrange(n):
    res1 = html_source
    res1 = regx.sub(repl,res1)
print 'with regex  ',clock()-te,'seconds'


te = clock()
for i in xrange(n):
    res2 = html_source
    for entity, replacement in REPL_tu:
        res2 = res2.replace(entity, replacement)
print 'with replace',clock()-te,'seconds'

print res1==res2

result

Exemple:

A tag &lt;bidi&gt; has a &quot;weird&#8220;&#8226;&apos;content&apos;
A tag <bidi> has a "weird"-'content'

-----------------------------------------
Downloading a web source:

with regex   0.097578323502 seconds
with replace 0.213866846205 seconds
True
eyquem
  • 26,771
  • 7
  • 38
  • 46