5

I want to replace:

'''<font size="3"><font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font></font>'''

With:

='''<font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font>'''=

Now my existing code is:

$html =~ s/\n(.+)<font size=\".+?\">(.+)<\/font>(.+)\n/\n=$1$2$3=\n/gm

However this ends up with this as the result:

=''' SUMMER/WINTER CONFIGURATION FILES</font>'''=

Now I can see what is happening, it is matching <font size ="..... all the way up to the end of the <font colour blue"> which is not what I want, I want it to stop at the first instance of " not the last, I thought that is what putting the ? mark there would do, however I've tried .+ .+? .* and .*? with the same result each time.

Anyone got any ideas what I am doing wrong?

rollsch
  • 2,518
  • 4
  • 39
  • 65

3 Answers3

8

Write .+? in all places to make each match non-greedy.

$html =~ s/\n(.+?)<font size=\".+?\">(.+?)<\/font>(.+?)\n/\n=$1$2$3=\n/gm
                ^                ^      ^            ^

Also try to avoid using regular expressions to parse HTML. Use an HTML parser if possible.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • Already tried this as per my comment and it didn't work. I haven't used HTML parsers before, any suggestions? – rollsch Dec 21 '10 at 04:16
7

You could change .+ to [^"]+ (instead of "match anything", "match anything that isn't a ""...

Jon
  • 16,212
  • 8
  • 50
  • 62
  • Tried that and it doesn't match anything at all, here is what I used: $html =~ s/\n(.+?)(.+)<\/font>(.+?)\n/\n===$1$2$3===\n/m; – rollsch Dec 21 '10 at 04:27
  • Hmm it worked on the string I posted in the example, but it fails to match at all on this example, ideas?:''' SUMMER/WINTER CONFIGURATION FILES''' – rollsch Dec 21 '10 at 04:29
4

As Mark said, just use CPAN for this.

#!/usr/bin/env perl

use strict; use warnings;
use HTML::TreeBuilder;

my $s = q{<font size="3"><font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font></font>};

my $tree = HTML::TreeBuilder->new;
$tree->parse( $s ); 
print $tree->find_by_attribute( color => 'blue' )->as_HTML;

# => <font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font>

This works for your specific case, however:

#!/usr/bin/env perl

use strict; use warnings;

my $s = q{<font size="3"><font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font></font>};

print $s =~ m{
                 < .+? >
                 (.+)?
                 </.+? >                
             }mx;

# => <font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font>
Pedro Silva
  • 4,672
  • 1
  • 19
  • 31