Using re
for parsing HTML not really needed as you do have many brilliantly written libraries for that, But still One way you can achieve what you want by:
- parsing tags.
- change their innerHtml.
Lets say you have some html:
a = """
<title>GateUser UserGate</title>
<div style="something">
KameHame Ha
</div>
"""
Now you can relatively easily parse the tags including the innerHtml:
blanks = r"([\s\n\t]+?)" # totally optional depending on code indentation and stuff.
pat = re.compile(r"(<.+>){0}(.*?){0}(</.+>)".format(blanks))
# tuples don't support item assignment, so mapping list, but still tuples fine too.
tags_with_inner = list(map(list, pat.findall(a)))
# [ ['<title>', '', 'GateUser UserGate', '', '</title>'],
# ['<div style="something">', '\n ', 'KameHame Ha', '\n ', '</div>']]
And then match your regex
on the inner only:
only_inner = re.compile(r"\b\w{8}\b") # your expression
for inner in tags_with_inner:
inner[2] = only_inner.sub("ADDED", inner[2])
print ("".join(inner))
# <title>ADDED ADDED</title>
# <div style="something">
# ADDED Ha
# </div>