0

I am new to perl and regex. I think I understand the idea and how to use regex, but I got stuck on a problem while writing a script. I have content from some page and I am trying to read some information.

my @rows = split(/<tr(\s)bgcolor=.{8}/,$content);

foreach my $row(@rows){
    if( $row =~/<td\s+nowrap\s+align=.*\s?(bgcolor=.*\s+)?>\w*\s?<\/td>/ig){
    print $1;
    print $file_opt $row."\n";

    # there will be more code later on
    } 
}

This gives me an error that $1 is uninitialized. I understand that happens when pattern does not match the string. But i have regex under if - so if it enters the if, it does match, rigth? As you can see, i printed rows to a file. Each one looks like this:

<td nowrap align="right">DOLNOŚLĄSKIE</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">4</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >1</td><td nowrap align="right">3</td><td nowrap align="right" bgcolor=#D0E0D0 >6</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >2</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >19</td><td nowrap align="right">0</td></tr>

And all of unnecessary things from $content are not in a file. So does this pattern match or not?

Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
kamila
  • 74
  • 6
  • Use a parser instead. – hwnd Jun 17 '14 at 15:30
  • 1
    Don't use a regex for this. Use an HTML parser instead. – Amal Murali Jun 17 '14 at 15:31
  • 6
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/perl for examples of how to use existing Perl modules that have already been written, tested and debugged. – Andy Lester Jun 17 '14 at 15:31
  • Required reading: http://stackoverflow.com/a/1732454/18157 – Jim Garrison Jun 17 '14 at 15:59
  • thanks for advice. i guess my college from whom i have sample codes is not that good as everyboy assumes :D – kamila Jun 17 '14 at 17:56

2 Answers2

4

From the code in your post, it looks like you are trying to capture the bgcolor attribute for each table cell in a given row. Not all of the cells have a bgcolor set, but some of them do. Here's how you can extract that information using HTML::TreeBuilder:

use HTML::TreeBuilder 5 -weak;

my $html = q{<td nowrap align="right">DOLNOŚLĄSKIE</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">4</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >1</td><td nowrap align="right">3</td><td nowrap align="right" bgcolor=#D0E0D0 >6</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >2</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >19</td><td nowrap align="right">0</td></tr>};

my $t = HTML::TreeBuilder->new_from_content($html);

foreach my $col ( $t->look_down('_tag','tr')->content_list ) {
  print $col->attr('bgcolor'), "\n" if defined $col->attr('bgcolor');
}

I'm sure you need to retrieve more than that, but it's all we are able to determine given the vague description and incomplete code of your question.

But the point is solid; don't parse HTML with regexes, parse HTML with an HTML parser. It's a slightly steeper learning curve at the beginning, but the result will be more robust, easier to maintain, and the skill you learn will be applicable to any HTML document, not just this particular one.

HTML::TreeBuilder comes with some good documentation, but you've got to read a good portion of it to make sense of the whole thing.

There's another HTML parsing module, Mojo::Dom, which comes with the Mojolicious framework. Personally, I find it easier to use, but sometimes when I post examples people seem to jump to the conclusion that they have to load some heavy-weight web framework to use it (which isn't entirely true, but I'm tired of swimming up-stream. ;). You might want to have a look at it and see if it better fits your taste. Here's an example:

use Mojo::DOM;

my $html = q{<td nowrap align="right">DOLNOŚLĄSKIE</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">4</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >1</td><td nowrap align="right">3</td><td nowrap align="right" bgcolor=#D0E0D0 >6</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >2</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >19</td><td nowrap align="right">0</td></tr>};

for my $td ( Mojo::DOM->new($html)->find('td[bgcolor]')->each ) {
  print $td->attr('bgcolor'), "\n";
}

Both of those code examples will produce the following output:

#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0

...which probably isn't terribly useful, but is exactly what the code you posted seems to want to capture. At least it's a starting point that you should be able to adapt to your own needs.

I believe the documentation for Mojo::DOM is more approachable, which might just make the difference, especially if you're new to Perl. My recommendation would be to start there, and build your solution around that module. In the longrun you'll be much better off than tearing your hair out using regexes to extract data from HTML.

The Mojolicious distribution installs in under a minute on most systems, and includes the Mojo::DOM module, which on its own is quite light-weight. It's a good option.

DavidO
  • 13,812
  • 3
  • 38
  • 66
  • thank you for such exhaustive answer. i will definetely follow your adivces and read more about packages you've mentioned :) – kamila Jun 17 '14 at 17:57
  • @kamila For a nice 8 minute introductory video to `Mojo::DOM`, check out [`Mojocast Episode 5`](http://mojocasts.com/e5) – Miller Jun 17 '14 at 22:57
2

Do not handcraft regex to parse html, yadda yadda, now to your actual question:

"But i have regex under if - so if it enters the if, it does match, right?"

In your regex you have a ? quantifier behind your capture group. That means it can (and does on your example) match with finding your capture group either once or no times. If the best match for your regex happens to involve the capture group zero times, then nothing will be captured and $1 remains empty. Get rid of that question mark to make sure your regex only matches when it did actually capture something.

If used like that on your example it works and does capture something.

While one might assume that it will always capture something if it can (as shown here when it does suddenly work without the quantifier) due to the quantifier being greedy, there are so many quantifiers in there, it is just another one that gets to be greedy first.

DeVadder
  • 1,404
  • 10
  • 18