Python regex, avoid skipping brackets

Question

I want to replace a regex with '*', but only if the regex is out side of <>. The whole point is to not interfere with the html tags.

I use this to replace:

re.sub(r'SOMEREGEX(?=[^>]*(<|$))', '*', line)

However I ran into his problem: if my regex is:

f.*k

Then this:

fzzzzzzzzz<HTMLTAG>zzzzzzzk

Would become an '*', which I don't want. How do I overcome this problem?

Constraints:

-All brackets are matched

-No nested brackets

-SOMEREGEX is provided by the user. I prefer not changing that.

[You can't parse html with regex](http://stackoverflow.com/a/1732454/350351) — Daenyth, Jun 16 '12 at 19:52

score 2 · Accepted Answer · edited Jun 16 '12 at 01:24

2

You could try replacing the . character - "any character at all" - with the character class [^<>], which matches any character except the angle brackets, <>. This would give the regex f[^<>]*k. This would match facebook but not face<b>book.

There are still things that can go wrong with this, though. Have you considered using a proper HTML parser instead of regular expressions? BeautifulSoup is easy, tasty and fun.

edited Jun 16 '12 at 01:24

Hugh Bothwell

55,315
8
84
99

answered Jun 15 '12 at 23:08

Li-aung Yip

12,320
5
34
49

The thing is, I don't have any control over the given regex. I just want to protect all my tags, and they can be anywhere in the middle of the sentence – Squall Leohart Jun 15 '12 at 23:27
1

Can you please clarify your constraints? (Edit your question to include any additional information that's relevant.) – Li-aung Yip Jun 15 '12 at 23:30

score 0 · Answer 2 · answered Jun 15 '12 at 23:37

0

Search between the end and start angle brackets:

re.sub(r'(^|>)f[^<]*k(<|$)', r'\1*\2', line)

The \1 and \2 are required to replace the angle brackets that the pattern may have removed from line.

answered Jun 15 '12 at 23:37

gvl

903
7
16

Python regex, avoid skipping brackets

2 Answers2