-4

I have the following perl code:

# $content is the text of a webpage
while ($content =~ /rgRow.*?<td>(.*?)<\/td><td.*?>(.*?)<\/td><td.*?>(.*?)<\/td><td.*?>.*?<\/td><td.*?>(.*?)<\/td><td.*?><nobr>(.*?)<\/nobr><\/td>/sg) {
   # do stuff
}

I have worked out that the code is hanging at this regex call. It gets about 2-3 iterations into the while loop and then it just hangs. I have left it for about 30 mins and it has not proceeded.

What could be the problem?

The purpose of the code is to go through some HTML and extract some data out of it.

Here is the HTML that I am setting $content to:

<tbody>
        <tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__0">
            <td>CONSIDERATION OF REPORTS SUBMITTED BY STATES PARTIES UNDER ARTICLE 9 OF THE CONVENTION : SECOND PERIODIC REPORT OF STATES PARTIES DUE IN 1974 / MOROCCO</td><td>State party's report</td><td>CERD</td><td>Morocco</td><td>CERD/C/R.65/Add.1</td><td><nobr>21 Feb 1974</nobr></td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl04_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=CERD%2fC%2fR.65%2fAdd.1&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">CERD/C/R.65/Add.1</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__1">
            <td>CONSIDERATION OF REPORTS SUBMITTED BY STATES PARTIES UNDER ARTICLE 9 OF THE CONVENTION : INITIAL REPORTS OF STATES PARTIES WHICH ARE DUE IN 1972 / MOROCCO</td><td>State party's report</td><td>CERD</td><td>Morocco</td><td>CERD/C/R.33/Add.1</td><td><nobr>17 Jan 1972</nobr></td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl06_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=CERD%2fC%2fR.33%2fAdd.1&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">CERD/C/R.33/Add.1</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__2">
            <td>Annex I to ALGERIA's Report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl08_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13691&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13691_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13691</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__3">
            <td>Annex II to ALGERIA's report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl10_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13692&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13692_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13692</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__4">
            <td>Annex III to ALGERIA's report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl12_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13693&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13693_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13693</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__5">
            <td>CERD-C-NZ-18-20_Annexes</td><td>Annex to State party report</td><td>CERD</td><td>New Zealand</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl14_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fNZL%2f13731&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_NZL_13731_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/NZL/13731</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__6">
            <td>CERD.C.RUS.20-22_Annex1</td><td>Annex to State party report</td><td>CERD</td><td>Russian Federation</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl16_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fRUS%2f13732&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">R</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_RUS_13732_R.doc</td><td style="display:none;">INT/CERD/ADR/RUS/13732</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__7">
            <td>Annex to State party report</td><td>Annex to State party report</td><td>CERD</td><td>Poland</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl18_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fPOL%2f15432&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_POL_15432_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/POL/15432</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__8">
            <td>Annexe X</td><td>Annex to State party report</td><td>CERD</td><td>Belgium</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl20_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fBEL%2f15561&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">F</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_BEL_15561_F.pdf</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/BEL/15561</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__9">
            <td>Annexe XI</td><td>Annex to State party report</td><td>CERD</td><td>Belgium</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl22_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fBEL%2f15562&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">F</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_BEL_15562_F.pdf</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/BEL/15562</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
</tr>
</tbody>

I am trying the following line to see how it goes instead:

while ($content =~ m/rgRow.+?<td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td>/gs)

The original code was not mine.

CJ7
  • 22,579
  • 65
  • 193
  • 321
  • Please show the HTML you are trying to parse. Anyways, regex is not the right tool to parse HTML, why don't you use a HTML parser? – Arunesh Singh Feb 26 '16 at 05:27
  • 5
    [Required reading for anyone trying to parse XML/HTML with regex](http://stackoverflow.com/a/1732454/18157). Summary: Don't parse HTML/XML with regex, use an appropriate parser. – Jim Garrison Feb 26 '16 at 05:31
  • Agree with the above, but if you need to do it this one time, how about breaking that nasty line up with `qr`? It'd be far easier to look at. – zdim Feb 26 '16 at 05:40
  • I'd prefer some answers/comments about how maybe the regex above would result in 'catastrophic backtracking'. – CJ7 Feb 26 '16 at 05:58
  • As for me, I can't see exactly what it's doing. And as for my comment, it was meant to be constructive: if that was broken up via `qr` it would be much easier to see its structure and then maybe to notice where it could be spinning. But, with best intentions (please take no offense): I've had good experience with `HTML::TableExtract;`. – zdim Feb 26 '16 at 06:04
  • Enable `use re 'debug';` and you will see what the regex engine is doing. – Sobrique Feb 26 '16 at 10:34
  • @Sobrique what will the output be? – CJ7 Feb 26 '16 at 11:31
  • It'll print what the regex engine is doing. It may be very verbose, but it'll help you see how many 'steps' are involved in the process (and thus if backtracking is happening) – Sobrique Feb 26 '16 at 11:44
  • @AruneshSingh I have give the HTML. See my edit to the question. – CJ7 Feb 26 '16 at 22:34
  • @JimGarrison Why would you refer to that question with that bizarre and unhelpful answer? The HTML parsing modules use regex. Why can't I? – CJ7 Feb 26 '16 at 22:36
  • Which HTML parsing modules? They might use regex at the individual lexical _token_ level (many lexers use regex at the core) but not at the tag level. HTML and XML are not regular languages, and trying to use regex is asking for trouble in the long run. It's like using a chainsaw to carve a turkey. The referenced answer embodies the sentiment shared by all experienced developers who have been asked "how do I use regex to parse HTML". The only acceptable answer is "Don't use regex, use a real HTML or XML parser". – Jim Garrison Feb 26 '16 at 22:47
  • It doesn't hang for me. It orderly exits `while` after 2 (two) iterations. (I think I see the problem, though. Will get to it later.) – zdim Feb 27 '16 at 05:17
  • @zdim I tried it at home and at work and it hangs after 2 iterations. It's interesting that it's exiting for you. – CJ7 Feb 27 '16 at 10:44
  • @JimGarrison That answer doesn't embody anything. It is just someone trolling. Everyone is treating it like the emperor wearing no clothes. – CJ7 Mar 07 '16 at 03:58
  • I don't know why regexes are so mystical. Yes, they can be powerful when used to parse regular languages. I guess every person that learns regexes for the first time imbues them with magical powers and they seem to be the solution for all parsing problems. If you've had to clean up somebody else's code that misused regexes you'd identify with that answer. It's not trolling, it's a heartfelt plea for people to stop worshipping at the regex altar and learn how an when they are appropriate. It's also very funny. – Jim Garrison Mar 07 '16 at 05:44
  • @JimGarrison Ok, give me an example of a perl HTML parser that you would recommend. I will look at the source to see if it uses regex. – CJ7 Mar 07 '16 at 11:25
  • @CJ7 I am curious, how did this work? I could never get it to hang but this may still go around whatever the problem is. – zdim Mar 09 '16 at 06:29

2 Answers2

0

I take this problem as a matter of debugging old code. (Still, see the end for a parser example.)

The reported problem is that the regex hangs. For me it exits after a few matches, on the first line. The first suspect to me is a loose new line; /s modifier only makes . match a new line. Another suspect is rgRow phrase explicitely matched for -- it is also an attribute in <td> tags, so matched under .*, too -- a conflict? Finally, the regex explicitly seeks each cell while /g modifier is used as well. For reference, this is the regex, used in code with /sg modifiers.

$patt = qr/rgRow.*? 
    <td>   (.*?)<\/td>
    <td.*?>(.*?)<\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?> .*? <\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?> <nobr>(.*?)<\/nobr> <\/td>
/x; 

Picking through the source char by char is not pleasant, and it doesn't work in general. We can do the following instead: remove new lines, then capture contents of <td> tags into an array. The purpose, stated in regex, is precisely to get that. (I change regex delimiter to avoid editor coloring.)

use warnings;
use strict;

my $msg = 'pulled_from_url';
(my $msg_nonl = $msg) =~ s%\n%%g;

my @raw_cells = $msg_nonl =~ |<td.*?>(.*?)<\/td>|g;

# Once we are at it: strip <nobr>, &nbsp;, drop empty elements
@cells = grep { !/^\s*$/ } map {  s%<\/?nobr>|&nbsp;%%g; $_ } @raw_cells;
# Get links ("View Document") out as well
@content = grep  {  !/<a.*?\/a>/ } @cells;
print "Total of " . scalar(@raw_cells) . " cells. ";
print "Cleaned up, down to " . scalar(@content) . " cells.\n";
print "$_\n" for @content;

This prints cells' content, edited here for space

Total of 280 cells. Cleaned up, down to 82 cells.
CONSIDERATION OF REPORTS SUBMITTED BY ... DUE IN 1974 / MOROCCO
State party's report
...
21 Feb 1974
...
True
CONSIDERATION OF REPORTS SUBMITTED BY ... DUE IN 1972 / MOROCCO
State party's report
...
17 Jan 1972
...
True

By inspecting the HTML we can see that the contents is pulled correctly.

I do not mean to judge poster's motives or, rather, restrictions. However, I can't help it but compare the above guess work and careful source reading with the following.

use HTML::TableExtract;   
my $te = HTML::TableExtract->new( keep_html => 1 );
$te->parse( "<table> " . $msg . "</table>" );
# We have one table, use top-level 'rows()' shorthand method
foreach my $row ($te->rows) {
    print join(',', @$row), "\n";
}

This reports the same 280 cells (when counting is added), and prints the same lines as one of the steps above. I only needed to glance at the source to see that it was missing <table> tags. The HTML::TableExtract is a subclass of HTML::Parser.

zdim
  • 64,580
  • 5
  • 52
  • 81
0

Your regex requires the sixth column to contain <nobr>...</nobr> tags, which only happens in the first two rows. It hangs after that because non-greedy quantifiers can only do so much. When no match is possible, they're just as susceptible to catastrophic backtracking as the greedy variety.

Instead of relying on .*? all the time, try to be specific about what you don't want to match. In this case, that's simple: the TD's you're matching never contain other tags, so you can use [^<>]* to capture their contents. In fact, you should use that everywhere you're currently using .*?.

In the regex below, I also made the NOBR tags optional, plus I extended it to match the whole opening TR tag, more for readability's sake that anything else.

while ($content =~ 
  m!<tr\s+class="rgRow[^<>]*>\s*
    <td[^<>]*>([^<>]*)</td>
    <td[^<>]*>([^<>]*)</td>
    <td[^<>]*>([^<>]*)</td>
    <td[^<>]*>[^<>]*</td>
    <td[^<>]*>([^<>]*)</td>
    <td[^<>]*>(?:<nobr>)?([^<>]*)(?:</nobr>)?</td>
  !sxg) {
    # do stuff
}
Alan Moore
  • 73,866
  • 12
  • 100
  • 156