0

I'm creating a django filter for inserting 'a' tags into a given string from a list.

This is what I have so far:

def tag_me(text):
    tags = ['abc', 'def', ...]
    tag_join = "|".join(tags)
    regex = re.compile(r'(?=(.))(?:'+ tag_join + ')', flags=re.IGNORECASE)
    return regex.sub(r'<a href="/tag/\1/">\1</a>', text)

Example:

tag_me('some text def')

Returns:

'some text <a href="/tag/d/">d</a>'

Expected:

'some text <a href="/tag/def/">def</a>'

The issue lies in the regex.sub as it matches but returns only the first character. Is there a problem with the way I'm capturing/using \1 on the last line ?

Ogre
  • 159
  • 1
  • 1
  • 11
  • 2
    As always, it's worth noting that [regular expressions are not the best tool for the job of parsing HTML](http://stackoverflow.com/a/1732454/722121). – Gareth Latty Mar 08 '13 at 18:12
  • Yeah, duly noted. Originally I was looping over all tags but wondered if regex would be a performance booster...? – Ogre Mar 08 '13 at 18:14
  • 2
    Is performance an issue? Even if it is, I sincerely doubt a regex will be significantly faster than a good HTML parser, and it is far more likely to be fragile. – Gareth Latty Mar 08 '13 at 18:16
  • 1
    @Lattyware: Ogre is parsing text and generating HTML, not (necessarily) parsing HTML. – RichieHindle Mar 08 '13 at 18:20

4 Answers4

3

Note that the sequence (?: ...) in the question specifically turns off capture. See re documentation (about 1/5 thru page) which (with emphasis added) says:

(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

As noted in previous answer, '('+ tag_join + ')' works, or use the suggested "|".join(re.escape(tag) for tag in tags) version if escapes are used in the target text.

James Waldby - jwpat7
  • 8,593
  • 2
  • 22
  • 37
2

You're capturing the (.) part, which is only one character.

I'm not sure I follow your regular expression - the simplified version r'('+ tag_join + ')' works fine for your example.

Note that if there's a chance of anything other than alphanumeric characters in your tag names, you'll want to do this:

tag_join = "|".join(re.escape(tag) for tag in tags)
RichieHindle
  • 272,464
  • 47
  • 358
  • 399
2

Simply do

import re

def tag_me(text):
    tags = ['abc', 'def']
    reg = re.compile("|".join(tags).join('()'),
                       flags=re.IGNORECASE)
    return reg.sub(r'<a href="/tag/\1/">\1</a>', text)

print '            %s' % tag_me('some text def')
print 'wanted:     some text <a href="/tag/def/">def</a>'

That's because you write a non-captured group (?:....) that you must then put this disturbing (?=(.)) in front.

eyquem
  • 26,771
  • 7
  • 38
  • 46
1

This should do it

def tag_me(text):
    tags = ['abc', 'def', ]
    tag_join = "|".join(tags)
    pattern = r'('+tag_join+')'
    regex = re.compile(pattern, flags=re.IGNORECASE)
    return regex.sub(r'<a href="/tag/\1/">\1</a>', text)
Julien Grenier
  • 3,364
  • 2
  • 30
  • 43