PHP/Perl Regular expression help!

Question

I have a string:

$string = "This is my big <span class="big-string">string</span>";

I cannot figure out how to write a regular expression that will replace the 'b' in 'big' without replacing the 'b' in 'big-string'. I need to replace all occurances of a substring except when that substring appears in an html tag.

Any help is appreciated!

Edit

Maybe some more info will help. I'm working on an autocomplete feature that highlights whatever you're searching for in the current result set. Currently if you have typed 'aut' in the search dialog, then the results look like this: automotive

The problem appears when I search for 'auto b'. First I replace all occurrences of 'auto' with 'auto' then I replace all occurrences of 'b' with 'b'. Unfortunately this second sweep changes 'auto' to '<b>auto</b>'

Is your real string mainly HTML or do you have only small snippets? — Felix Kling, Jul 08 '11 at 22:30
Here is a hint...look at the surrounding chars of each "b". One has whitespace before it and the other is in quotes. — brady.vitrano, Jul 08 '11 at 22:31
I had a similar problem once. I replaced all `div`s with `span`s and ended up with several elements with the class "**spanider**". — Nicole, Jul 08 '11 at 22:35
using 'b' is just an example. I might need to replace 'a' but I don't want to mess up 'class' — orourkedd, Jul 08 '11 at 22:55
@user1750 if your string isn't specific enough that you can't rule out being the start of an attribute, then you will probably need to parse your HTML. [This answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) famously explains why. Some rough regexes can be performed on HTML, but they will but just that - rough. To detect whether a string is inside a tag or not is very hard unless your search is already very specific. — Nicole, Jul 08 '11 at 23:07

score 3 · Accepted Answer · answered Jul 08 '11 at 23:22

3

Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, walk over the text nodes and do a simple str_replace. You'll thank me around debugging time.

answered Jul 08 '11 at 23:22

Sander Marechal

22,978
13
65
96

1

+1: apparently we have to tell every single person this individually when they discover regular expressions – Brad Mace Jul 09 '11 at 03:33

score 0 · Answer 2 · answered Jul 08 '11 at 22:33

0

Is there a guarantee that 'big' won't be immediately preceded by "? If so, then s/([^"])b/$1foo/ should replace the b in question with foo.

answered Jul 08 '11 at 22:33

what does the $1 mean? I've seen it around in a few places – orourkedd Jul 08 '11 at 22:52
no guarantee. I basically need a regular expression to ignore anything between the tag characters < > – orourkedd Jul 08 '11 at 22:54
In that case, you shouldn't be using a regex. Also, $1, $2, etc, get variables matched by parentheses. Take a look at `perldoc perlrequick`. – Jul 11 '11 at 13:59

score 0 · Answer 3 · answered Jul 12 '11 at 02:12

If you insist upon using a regex, this one will do a pretty decent job:

$re = '/# (Crudely) match a sub-string NOT in an HTML tag.
    big        # The sub-string to be matched.
    (?=        # Assert we are not inside an HTML tag.
      [^<>]*   # Consume all non-<> up to...
      (?:<\w+  # either an HTML start tag,
      | $      # or the end of string.
      )        # End group of valid alternatives.
    )          # End "not-in-html-tag" lookahead assertion.
    /ix';

Caveats: This regex has very real limitations. The HTML must not have any angle brackets in the tag attributes. This regex also finds the target substring inside other parts of the HTML file such as comments, scripts and stylesheets, and this may not be desirable.

PHP/Perl Regular expression help!

3 Answers3