As Christian König's link says, parsing HTML with regex is generally not a wise idea. However, if you're very careful you can sometimes get away with it if the HTML is relatively simple and stable, but if the format of the page you're parsing changes, your code is likely to break. But anyway...
The pattern given above does not work: it will also perform a replacement on "> and <"
.
Here's a way to do what you want. We use a function as the repl
arg to re.sub
, and we give the function a counter (as a function attribute) so it knows what id number to use. This counter gets incremented every time a replacement is made, but you are free to set the counter to any value you want before calling re.sub
.
import re
pat = re.compile(r'<span>(.*?)</span>', re.I)
def repl(m):
fmt = '<span><a id="id{}" class="my_class">{}</a></span>'
result = fmt.format(repl.count, m.group(1))
repl.count += 1
return result
repl.count = 1
data = (
"<span>Replace this</span> and <span>that</span>",
"<span>Another</span> test <span>string</span> of <span>tags</span>",
)
for s in data:
print('In : {!r}\nOut: {!r}\n'.format(s, pat.sub(repl, s)))
repl.count = 10
for s in data:
print('In : {!r}\nOut: {!r}\n'.format(s, pat.sub(repl, s)))
output
In : '<span>Replace this</span> and <span>that</span>'
Out: '<span><a id="id1" class="my_class">Replace this</a></span> and <span><a id="id2" class="my_class">that</a></span>'
In : '<span>Another</span> test <span>string</span> of <span>tags</span>'
Out: '<span><a id="id3" class="my_class">Another</a></span> test <span><a id="id4" class="my_class">string</a></span> of <span><a id="id5" class="my_class">tags</a></span>'
In : '<span>Replace this</span> and <span>that</span>'
Out: '<span><a id="id10" class="my_class">Replace this</a></span> and <span><a id="id11" class="my_class">that</a></span>'
In : '<span>Another</span> test <span>string</span> of <span>tags</span>'
Out: '<span><a id="id12" class="my_class">Another</a></span> test <span><a id="id13" class="my_class">string</a></span> of <span><a id="id14" class="my_class">tags</a></span>'