5

*Note: The output of the Array() is a PHP print_r()*

I have this HTML tag:

<tr>
    <td width="40" align="left"><div class="icSkill" id="skill4"></div></td>
    <td colspan="2">SOME_VALUE_I_WANT&nbsp;</td>
</tr>

I really want to extract this with RegEx and don't want to use HTML parsers in this case.

I do this Regex (I use the s-flag to ignore the file's newlines):

\<tr\>\<td\swidth="40"\salign="left"\>\<div\s+class="icSkill"\s+id="skill(\d+)".*\<\/tr\>

Problem now is that the Regex doesn't stop at the first found close TR tag, but I want it to. I know it probably has something todo with assertions, only I don't know how to.

Array
(
    [0] => <tr><td width="40" align="left"><div class="icSkill" id="skill4"></div></td><td colspan="2">SOME_VALUE_I_WANT&nbsp;
</td></tr><tr><td rowspan="2" align="left"><div class="icGuard" id="guard9"></div></td></tr>
    [1] => 4
)

The basic examples like: /[^<]*/ won't work in this case. Is there also a way to tell regex something like:

/[^A_STRING]*/ (in words; stop unless you find A_STRING)
OR BETTER EXAMPLE:
/[^A_STRING_FIRST_TIME]*/ (in words; stop unless you find A_STRING for the FIRST_TIME)
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • 1
    Why do you not want to use an HTML parser in this case? – Asad Saeeduddin Dec 12 '12 at 14:46
  • Where is your code ? We can't see the modifiers you are using. Most likely you are missing the `U` modifier. Besides, make yourself to want using html parsers (aka domdocument) – Alex Dec 12 '12 at 14:47
  • Maybe I should've asked the question differently, sorry. I just want to know if I can do \[^a]\ where "a" would be a string. Don't want this \[^abc]\, because it allows the characters all to be valid exceptions. –  Dec 12 '12 at 14:47
  • @Alex: I'm using the global and dotall flags. –  Dec 12 '12 at 14:48
  • Also, why are you escaping angle brackets? – Asad Saeeduddin Dec 12 '12 at 14:49
  • @Allendar PHP doesn't have a global flag. Globalness is determined by the function you use. – Martin Ender Dec 12 '12 at 14:49
  • @m.buettner: I use preg_match_all(). –  Dec 12 '12 at 14:51
  • @Allendar then using this function is enough to make the search global. no need to use `g` (because it simply doesn't exist in PHP) – Martin Ender Dec 12 '12 at 14:51
  • @RohitJain: this seems to work in this case, thank you. –  Dec 12 '12 at 14:51
  • @m.buettner: Yes sorry my bad. I only do /REGEX/s in the preg_match_all(). I was just noting it because it's set as active in the tool I test my Regex in :) –  Dec 12 '12 at 14:52

2 Answers2

9

The problem is greediness. .* consumes as much as it can. You can make it ungreedy by appending ?:

~<tr><td\s+width="40"\s+align="left"><div\s+class="icSkill"\s+id="skill(\d+)".*?</tr>~s

Also, as you can see, there is really no need to do so much escaping. It only hinders legibility.

An alternative way to make repetition ungreedy, is to use the modifier U, which makes all repetition ungreedy globally in the whole pattern. I prefer the local variant (using ?), though.

In any case, there is a different possibility which mimics [^A_STRING]* (which doesn't work, because it matches any string of characters, that do not include A, _, S, T, R, I, N or G). You can use a negative lookahead at every position of the repetition:

(?:(?!A_STRING).)*

(substitute this for .* or .*?). It should be equivalent in most cases, but execution time might differ. Plus, it's a little harder to decipher.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Thanks, really great! I mostly wanted to know "string"-matching for the future too. Concerning the over exaggerated escaping; I'm trying to keep to the illegal characters on the cheat-sheet I use from AddedBytes.com. It states Metacharacters that must be escaped: ^[.${*(\+)|?<> Yet somehow PHP sometimes seems to choke if I don't escape the forward slash (/). Once again my thanks ^^ –  Dec 12 '12 at 15:01
  • @Allendar `/` only needs to be escaped if you use `/` as your delimiter. Hence, never escape `/`. Instead, look for a delimiter character that is not part of your regex. For ther others. I agree with all of these, but `<>` are only meta-characters if used like `(?...` or `(?P...`. So you can usually leave them unescaped, too. Regular expressions are usually tough to read anyway, so I would go for reducing clutter by escaping as much as possible. (Also in character classes, you only need to escape `^`, `]`, `\ ` and `-` - just FYI) – Martin Ender Dec 12 '12 at 15:05
  • Thanks M, this really helps me a lot! I'm did some Regex in the past, but every time I pick it up again I feel like I have to relearn it, haha. –  Dec 12 '12 at 15:07
  • @Allendar the page I've linked twice is a very good read to actually properly get your head around regular expressions, so you will have to relearn less of it next time you need to use them: [www.regular-expressions.info](http://www.regular-expressions.info/tutorial.html) – Martin Ender Dec 12 '12 at 15:08
  • That lazy match saved my life. I started to learn lookahead and lookbehind matches when my regex just needed was a lazy match `*?` instead of a greedy match `*` – xploreraj May 16 '16 at 20:26
1

This is a tough one. Usually you'd have a class identifier in there which would make it easier.

So let's make sure that I understand what you want: You need to capture whatever is within the last <td> tag, just before we close the table row. In that case, you need a negative lookahead:

<td(?!.*?<td).*?>(.*?)<\/td>

This, together with the s modifier, will capture SOME_VALUE_I_WANT&nbsp;, provided it is in the last <td> element in the table row.

The only element in this regex which is not straightforward is the negative lookahead operator <td(?!.*?<td), which will capture only a <td> element that is not followed by another such element.

Also, when you use the star operator, you usually want to make sure that you modify it to be non-greedy as follows: (.*?). This means it stops at the first match.

NitayArt
  • 456
  • 3
  • 6
  • Thanks for the in-depth description Nitay. This makes things very clear ^^ –  Dec 12 '12 at 15:47
  • Won't that get the very last `` in the input, regardless of which `` it is in? – Martin Ender Dec 12 '12 at 17:29
  • @m.buettner Yes. It's not exactly clear from the OP what are the exact characteristics of the pattern he is trying to match. That's the way I understood it. Anyway, parsing HTML with regexes is probably not something we should do very often, they're too obtuse for this purpose. – NitayArt Dec 12 '12 at 20:42
  • agreed. there is a reason for for [this post](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) ;) – Martin Ender Dec 12 '12 at 20:54