0

What's wrong with my regex ?

"/Blabla\(2\)&nbsp;:.*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uis"

....

<tr>
<td class="aaa">Blabla(1)&nbsp;:</td>
<td>
<table class="bbb"><tbody>
<tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr>
<tr><td class="generic">word1</td><td class="generic">word2 </td><td class="generic">word3</td></tr>
<tr><td class="generic">word4</td><td class="generic">word5 </td><td class="generic">word6</td></tr>
</tbody></table>
</td>
</tr>

<tr>
<td class="aaa">Blabla(2)&nbsp;:</td>
<td>
<table class="bbb"><tbody>
<tr class="ccc"><th>title1</th><th>title2</th><th>title3</th></tr>
<tr><td class="generic">word1b</td><td class="generic">word2b </td><td class="generic">word3b</td></tr>
<tr><td class="generic">word4b</td><td class="generic">word5b </td><td class="generic">word6b</td></tr>
</tbody></table>
</td>
</tr

What I want to do is to get the content of the FIRST TD of each TR from the block beginning with Blabla(2).

So the expected answer is word1b AND word4b But only the first is returned...

Thank you for your help. Please don't answer me to use a DOM navigator, it's not possible in my case.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
wewereweb
  • 197
  • 1
  • 10
  • 3
    Which language you are using? And most important how are you using it? – Rohit Jain Oct 02 '13 at 17:08
  • 1
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Oct 02 '13 at 17:09
  • @Barmar The `U` flag turns `*` into ungreedy and `*?` into greedy – Jerry Oct 02 '13 at 17:15

2 Answers2

1

That's an interesting regex, in which I learned about the ungreedy flag, nice!

And for your problem, you might make use of \G to match immediately after the previous match and the flag g, assuming PCRE engine:

/(?:Blabla\(2\)&nbsp;:|(?<!^)\G).*<tr><td class=\"generic\">(.*)<\/td>.+<\/tr>/Uisg

regex101 demo

Or a little shorter with different delimiters:

'~(?:Blabla\(2\)&nbsp;:|(?<!^)\G).*<tr><td class="generic">(.*)</td>.+</tr>~Uisg'
Jerry
  • 70,495
  • 13
  • 100
  • 144
0

Thanks to @Jerry, I learn today new tricks:

(Blabla\(2\)&nbsp;:.*?|\G)<tr><td class=\"generic\">\K([^<]+).+?<\/tr>\r\n
Darka
  • 2,762
  • 1
  • 14
  • 31