0

I have a string like:

<tr><td>abc</td><td style="any" class="marked">dfg</td><td>hij</td></tr>

and I try to get marked td tag by such a regexp:

/<td.*class="marked.*<\/td>/si

but getting this:

<td>abc</td><td style="any" class="marked">dfg</td><td>hij</td>

How should I change my regexp to get such a string?

<td style="any" class="marked">dfg</td>
Raptor
  • 53,206
  • 45
  • 230
  • 366
Vito
  • 121
  • 2
  • 10
  • 1
    You should see [this answer](http://stackoverflow.com/a/1732454/1864610) –  Jan 20 '14 at 16:26

2 Answers2

1

.* is greedy and will match as much as possible.

.*? is lazy and will match as little as possible.

tl;dr: use .*? instead.

That said, regex is not an HTML parser, but we've been through this many times before

Community
  • 1
  • 1
h2ooooooo
  • 39,111
  • 8
  • 68
  • 102
1

You have two issues:

  1. Your expression doesn't guarantee that class="marked" is associated with the same tag as the <td at the start of the expression.
  2. The .*<\/td> at the end is greedy and will match all the way to the last closing </td>.

This pattern will address both these issues:

/<td[^>]+class="marked">.*?<\/td>/si
jmar777
  • 38,796
  • 11
  • 66
  • 64
  • As mentioned by @h2ooooooo in his answer, parsing HTML with regex is generally a bad idea. It's only acceptable when you can make certain guarantees about the markup. For example, this pattern will fail with `class="foo marked"`, or for a host of other reasons if the HTML isn't very similar to what was in your example. – jmar777 Jan 20 '14 at 16:30
  • jmar777: yes, I know that regexp not very good for html tasks. Thanks for the answer – Vito Jan 20 '14 at 16:49