There really isn’t a possible regex solution that works for an arbitrary number of table data and puts each cell into a separate back reference. That’s because with backreferences, you need to have a distinct open paren for each backref you want to create, and you don’t know how many cells you have.
There’s nothing wrong with using looping of one or another sort to pull out the data. For example, on the last one, in Perl it would be this, given that $tr
already contains the row you need:
@td = ( $tr =~ m{<td.*?>(.*?)</td>}sg );
Now $td[0]
will contain the first <td>
, $td[1]
will contain the second one, etc. If you wanted a two-dimensional array, you might wrap that in a loop like this to populate a new @cells
variable:
our $table; # assume has full table in it
my @cells;
while(my($tr) =~ $table = m{<tr.*?>(.*?)</tr>}sg) {
push @cells, [ $tr =~ m{<td.*?>(.*?)</td>}sg ];
}
Now you can do two-dimensional addressing, allowing for $cells[0][0]
, etc. The outer explicit loop processes the table a row at a time, and the inner implicit loop pulls out all the cells.
That will work on the canned sample data you showed. If that’s good enough for you, then great. Use it and move on.
What Could Ever Be Wrong With That?
However, there are actually quite a few assumptions in your patterns about the contents of your data, ones I don’t know that you’re aware of. For one thing, notice how I’ve used /s
so that it doesn’t get stuck on newlines.
But the main problem is that minimal matches aren’t always quite what you want here. At least, not in the general case. Sometimes they aren’t as minimal as you think, matching more than you want, and sometimes they just don’t match enough.
For example, a pattern like <i>(.*?)</i>
will get more than you want if the string is:
<i>foo<i>bar</i>ness</i>
Because you will end up matching the string <i>foo<i>bar</i>
.
The other common problem (and not counting the uncommon ones) is that a pattern like <tag.*?>
may match too little, such as with
<img alt=">more" src="somewhere">
Now if you use a simplistic <img.*?>
on that, you would only capture <img alt=">
, which is of course wrong.
I think the last major remaining problem is that you have to altogether ignore certain things in parsing. The simplest demo of this embedded comments (also <script>
, <style>, and
CDATA`), since you could have something like
<i> some <!-- secret</i> --> stuff </i>
which will throw off something like <i>(.*?)</i>
.
There are ways around all these, of course. Once you’ve done so, and it is really quite a bit of effort, you’ll find that you have built yourself a real parser, completely with a lot of auxiliary logic, not just one pattern.
Even then you are only processing well-formed input strings. Error recovery and failing softly is an entirely different art.