1

I want to wrap the text inside the following string in link tags. I do so with re.sub. It works, but I also need each of 2 link tags to have different id. How to achieve that?

input = "<span>Replace this</span> and <span>this</span>"
result = re.compile(r'>(.*?)<', re.I).sub(r'><a id="[WHAT TO PUT HERE?]" class="my_class">\1</a><', input)

Output should have different ids at link tags:

"<span><a id="id1" class="my_class">Replace this</a></span></span> and <span><a id="id2" class="my_class">this</a></span>"
user3024710
  • 515
  • 1
  • 6
  • 15
  • 3
    So, you're trying to [parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454)? – Christian König May 04 '17 at 10:03
  • @ChristianKönig Html is generated by my owm app, is simple and the input format will be always the same. Anyway, thank you for warning. – user3024710 May 05 '17 at 00:47

1 Answers1

1

As Christian König's link says, parsing HTML with regex is generally not a wise idea. However, if you're very careful you can sometimes get away with it if the HTML is relatively simple and stable, but if the format of the page you're parsing changes, your code is likely to break. But anyway...

The pattern given above does not work: it will also perform a replacement on "> and <".

Here's a way to do what you want. We use a function as the repl arg to re.sub, and we give the function a counter (as a function attribute) so it knows what id number to use. This counter gets incremented every time a replacement is made, but you are free to set the counter to any value you want before calling re.sub.

import re

pat = re.compile(r'<span>(.*?)</span>', re.I)

def repl(m):
    fmt = '<span><a id="id{}" class="my_class">{}</a></span>'
    result = fmt.format(repl.count, m.group(1))
    repl.count += 1
    return result
repl.count = 1

data = (
    "<span>Replace this</span> and <span>that</span>",
    "<span>Another</span> test <span>string</span> of <span>tags</span>",
)

for s in data:
    print('In : {!r}\nOut: {!r}\n'.format(s, pat.sub(repl, s)))

repl.count = 10
for s in data:
    print('In : {!r}\nOut: {!r}\n'.format(s, pat.sub(repl, s)))

output

In : '<span>Replace this</span> and <span>that</span>'
Out: '<span><a id="id1" class="my_class">Replace this</a></span> and <span><a id="id2" class="my_class">that</a></span>'

In : '<span>Another</span> test <span>string</span> of <span>tags</span>'
Out: '<span><a id="id3" class="my_class">Another</a></span> test <span><a id="id4" class="my_class">string</a></span> of <span><a id="id5" class="my_class">tags</a></span>'

In : '<span>Replace this</span> and <span>that</span>'
Out: '<span><a id="id10" class="my_class">Replace this</a></span> and <span><a id="id11" class="my_class">that</a></span>'

In : '<span>Another</span> test <span>string</span> of <span>tags</span>'
Out: '<span><a id="id12" class="my_class">Another</a></span> test <span><a id="id13" class="my_class">string</a></span> of <span><a id="id14" class="my_class">tags</a></span>'
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182