Python regex substituting in only the first character when compiling from a list

Question

I'm creating a django filter for inserting 'a' tags into a given string from a list.

This is what I have so far:

def tag_me(text):
    tags = ['abc', 'def', ...]
    tag_join = "|".join(tags)
    regex = re.compile(r'(?=(.))(?:'+ tag_join + ')', flags=re.IGNORECASE)
    return regex.sub(r'<a href="/tag/\1/">\1</a>', text)

Example:

tag_me('some text def')

Returns:

'some text <a href="/tag/d/">d</a>'

Expected:

'some text <a href="/tag/def/">def</a>'

The issue lies in the regex.sub as it matches but returns only the first character. Is there a problem with the way I'm capturing/using \1 on the last line ?

As always, it's worth noting that [regular expressions are not the best tool for the job of parsing HTML](http://stackoverflow.com/a/1732454/722121). — Gareth Latty, Mar 08 '13 at 18:12
Yeah, duly noted. Originally I was looping over all tags but wondered if regex would be a performance booster...? — Ogre, Mar 08 '13 at 18:14
Is performance an issue? Even if it is, I sincerely doubt a regex will be significantly faster than a good HTML parser, and it is far more likely to be fragile. — Gareth Latty, Mar 08 '13 at 18:16
@Lattyware: Ogre is parsing text and generating HTML, not (necessarily) parsing HTML. — RichieHindle, Mar 08 '13 at 18:20

score 3 · Accepted Answer · answered Mar 08 '13 at 18:47

Note that the sequence (?: ...) in the question specifically turns off capture. See re documentation (about 1/5 thru page) which (with emphasis added) says:

(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

As noted in previous answer, '('+ tag_join + ')' works, or use the suggested "|".join(re.escape(tag) for tag in tags) version if escapes are used in the target text.

Thanks! I missed what the start of the pattern does. Very thorough. — Ogre, Mar 08 '13 at 21:11

score 2 · Answer 2 · answered Mar 08 '13 at 18:18

2

You're capturing the (.) part, which is only one character.

I'm not sure I follow your regular expression - the simplified version r'('+ tag_join + ')' works fine for your example.

Note that if there's a chance of anything other than alphanumeric characters in your tag names, you'll want to do this:

tag_join = "|".join(re.escape(tag) for tag in tags)

answered Mar 08 '13 at 18:18

RichieHindle

272,464
47
358
399

@eyquem: Sorry, I don't understand your question...? – RichieHindle Mar 09 '13 at 07:38

score 2 · Answer 3 · answered Mar 08 '13 at 18:41

Simply do

import re

def tag_me(text):
    tags = ['abc', 'def']
    reg = re.compile("|".join(tags).join('()'),
                       flags=re.IGNORECASE)
    return reg.sub(r'<a href="/tag/\1/">\1</a>', text)

print '            %s' % tag_me('some text def')
print 'wanted:     some text <a href="/tag/def/">def</a>'

That's because you write a non-captured group (?:....) that you must then put this disturbing (?=(.)) in front.

score 1 · Answer 4 · answered Mar 08 '13 at 18:47

1

This should do it

def tag_me(text):
    tags = ['abc', 'def', ]
    tag_join = "|".join(tags)
    pattern = r'('+tag_join+')'
    regex = re.compile(pattern, flags=re.IGNORECASE)
    return regex.sub(r'<a href="/tag/\1/">\1</a>', text)

answered Mar 08 '13 at 18:47

Julien Grenier

3,364
2
30
43

Python regex substituting in only the first character when compiling from a list

4 Answers4