How to get a matched substring from a string with regular expression in perl

Question

Possible Duplicate:
How can I extract URL and link text from HTML in Perl?

I am trying to get the substring in a string .There could be more than one matched string with that name in the string.

<LI>
<A
 HREF="65378161_12011_Q.pdf"> 
65378161_12011_Q.pdf

</A>

From the above string i want to get the file name "65378161_12011_Q.pdf".

if($line=~ m/((.*)Q\.pdf)/i ){
          my $inside=$2;
           print " file name:$inside \n";
     }

This is what i tried but it does not get the right sub string. Can some one help on this. I really appreciate if some one can answer to my question.

score 0 · Answer 1 · edited May 23 '17 at 10:28

See the following script :

#!/usr/bin/env perl

use strict;
use warnings;

my $string = "65378161_12011_Q.pdf";


if($string =~ m/((.*)?Q\.pdf)/i ){
    my $inside=$2;
    print " file name:$inside \n";
}

Your code just lack the '?' character to tell the regex to be not greedy.

Another way is to match all of the characters that is not a 'Q' before itself :

m/(^[^Q]+)?Q\.pdf/i

Edit: Because you had edited your post with a different spec : If you need to parse HTML, I recommend to use a proper module :

Don't parse or modify html with regular expressions! See one of HTML::Parser's subclasses: HTML::TokeParser, HTML::TokeParser::Simple, HTML::TreeBuilder(::Xpath)?, HTML::TableExtract etc. If your response begins "that's overkill. i only want to..." you are wrong. http://en.wikipedia.org/wiki/Chomsky_hierarchy and here for why not to use regex on HTML

(This is a reminder about using regex to parse HTML from #perl channel on irc.freenode.org)

Edit 2:

Here a complete working example :

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content('
<LI>
<A
 HREF="65378161_12011_Q.pdf"> 
65378161_12011_Q.pdf

</A>
');

$tree->look_down("_tag", "a")->as_text =~ m/(^[^Q]+)Q\.pdf/i && print "$1\n";

Thanks for your suggestion ..I changes the question and i tried your recommendation but did not work .. — swati, Apr 23 '12 at 19:52
Some how this also did not work the out put what i got after the regular expression matching is file name:

score 0 · Answer 2 · answered Apr 23 '12 at 20:24

0

Use a HTML parser.

use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<LI>
<A
HREF="65378161_12011_Q.pdf">
65378161_12011_Q.pdf

</A>
HTML

$w->find('a')->attr('href');
# expression returns '65378161_12011_Q.pdf'
$w->find('a')->text;
# expression returns ' 65378161_12011_Q.pdf '

answered Apr 23 '12 at 20:24

daxim

39,270
4
65
132

That's a nifty looking module I haven't used yet :) – brian d foy Apr 23 '12 at 20:36

score -1 · Answer 3 · answered Apr 23 '12 at 19:53

-1

Since . will match everything, simply remove the parenthesis around it.

#!/usr/bin/perl

my $line = "65378161_12011_Q.pdf";

if ($line =~ m/(.*Q\.pdf)/i )
{
  my $inside = $1;
  print "filename = $inside\n";
}

Produces the correct output.

Hope it helps.

Manny

answered Apr 23 '12 at 19:53

MannyCalavera

53
1
7

Thanks for your suggestion this also did not work but again i changed the question can you see please check my question one more time as the input string is different now – swati Apr 23 '12 at 19:56

How to get a matched substring from a string with regular expression in perl

3 Answers3