-3

Possible Duplicate:
How can I extract URL and link text from HTML in Perl?

I am trying to get the substring in a string .There could be more than one matched string with that name in the string.

<LI>
<A
 HREF="65378161_12011_Q.pdf"> 
65378161_12011_Q.pdf

</A>

From the above string i want to get the file name "65378161_12011_Q.pdf".

if($line=~ m/((.*)Q\.pdf)/i ){
          my $inside=$2;
           print " file name:$inside \n";
     }

This is what i tried but it does not get the right sub string. Can some one help on this. I really appreciate if some one can answer to my question.

Community
  • 1
  • 1
swati
  • 2,099
  • 6
  • 19
  • 23

3 Answers3

0

See the following script :

#!/usr/bin/env perl

use strict;
use warnings;

my $string = "65378161_12011_Q.pdf";


if($string =~ m/((.*)?Q\.pdf)/i ){
    my $inside=$2;
    print " file name:$inside \n";
}

Your code just lack the '?' character to tell the regex to be not greedy.

Another way is to match all of the characters that is not a 'Q' before itself :

m/(^[^Q]+)?Q\.pdf/i

Edit: Because you had edited your post with a different spec : If you need to parse HTML, I recommend to use a proper module :

Don't parse or modify html with regular expressions! See one of HTML::Parser's subclasses: HTML::TokeParser, HTML::TokeParser::Simple, HTML::TreeBuilder(::Xpath)?, HTML::TableExtract etc. If your response begins "that's overkill. i only want to..." you are wrong. http://en.wikipedia.org/wiki/Chomsky_hierarchy and here for why not to use regex on HTML

(This is a reminder about using regex to parse HTML from #perl channel on irc.freenode.org)

Edit 2:

Here a complete working example :

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content('
<LI>
<A
 HREF="65378161_12011_Q.pdf"> 
65378161_12011_Q.pdf

</A>
');

$tree->look_down("_tag", "a")->as_text =~ m/(^[^Q]+)Q\.pdf/i && print "$1\n";
Community
  • 1
  • 1
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • Some how this also did not work the out put what i got after the regular expression matching is file name:
  • – swati Apr 23 '12 at 20:03
  • I wish you'd delete the regex solution, or at least put it at the bottom of your answer. – daxim Apr 23 '12 at 20:25