1

I need to extract string from html code. I have a regexp. After I open file (or after I make "get" request) I need to find pattern.

So, I have an html code and I want to find such string:

<input type="hidden" name="qid" ... anything is possible bla="blabla" ... value="8">

I want to find the string qid, then find after it the string value="435345" and extract 435345.

Now I am just trying to find this string (I have already done it) and then I will make a replacement (I am going to do it), but this code couldn't find the pattern. What is wrong?

open(URLS_OUT, $foundResults);
@lines = <URLS_OUT>;
$content = join('', @lines);

$content =~ /<qid\"\s*value=[^>][0-9]+/;
print 'Yes'.$1.'\n';

close(URLS_OUT);

or this code:

my $content = $response->content(); 

while ($content =~ /<qid\"\s*value=[^>][0-9]+/g)
    {
        print 'Yes'.$1.'\n';
    }

I have checked that the file is not empty and it is opened correctly (I have printed it out), but I the program can't find pattern. What is wrong? I have checked the regular expression using this cite (and some others): http://gskinner.com/RegExr/ It shows that the regular expression is correct and finds what I need.

rubber boots
  • 14,924
  • 5
  • 33
  • 44
user565447
  • 969
  • 3
  • 14
  • 29

5 Answers5

3

Update your regex like this:

/<qid\"\s*value=([^>][0-9]+)/

That is, add the "(" and ")" to capture the data in $1

Aziz Shaikh
  • 16,245
  • 11
  • 62
  • 79
3

Your idea how:

$content =~ /<qid\"\s*value=[^>][0-9]+/;

works is wrong. Please study basic Regex usage in Perl.

BTW: you shouldn't parse HTML by regex. There are a lot examples on the web and on SO how to do that correctly. Look it up!


For learning purpose, your regex would look like this (according to your comment):
my $content = q{
 <input type="hidden" id="qid" name="qid" bla="blabla" value="8">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="98">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="788">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="128">
 <input type="hidden" id="qid" name="qid" bla="blabla" value="8123">
};
my $regex = qr{ name=     # find the attribute 'name'
                "qid"     # with a content of "quid"
                .+?       # now search along until the next 'value'
                value=    # the following attribute 'value' 
                "(\d+)    # find the number and capture it
              }x;   ## allow the regex to be formatted   

while( $content =~ /$regex/g ) { # /g - search along
   print "Yes $1 \n"
}  

After you got this working, please study how to read the content with an HTML-Parser.

Community
  • 1
  • 1
rubber boots
  • 14,924
  • 5
  • 33
  • 44
  • thanks, I know that I should use HTML::Parser but for it's faster to use regexp. I need to do the work as faster as I could, moreover, I have a lot of texts (html, ASCII text, xml). That is why I use regexp. – user565447 Jul 04 '13 at 10:12
  • @user565447 - after 'discovering' your 'lost' tag from your original post, I updated my answer again (by `.+?`). Can you explain in detail what the regex does? – rubber boots Jul 04 '13 at 10:14
3

Use HTML::Parser to cope with messy real-world HTML.

#! /usr/bin/env perl

use strict;
use warnings;

use HTML::Parser;

sub start {
  my($attr,$attrseq) = @_;
  while (defined(my $name = shift @$attrseq)) {  # first ...="qid"
    last if $attr->{$name} eq "qid";
  }
  while (defined(my $name = shift @$attrseq)) {  # then value="<num>"
    if ($name eq "value" && $attr->{$name} =~ /\A[0-9]+\z/) {
      print "Yes", $attr->{$name}, "\n";
    }
  }
}

my $p = HTML::Parser->new(
  api_version => 3,
  start_h => [\&start, "attr, attrseq"],
);
$p->parse_file(*DATA);

__DATA__
<input type="hidden" name="qid" value="8">
<input type="hidden" name="qidx" value="000000">
<foo type="hidden" name="qid" value="9">
<foo type="hidden" name="qid" value="000000x">
<foo type="hidden" name="QID" value="000000">
<bar type="hidden" NAME="qid" value="10">
<baz type="hidden" name="qid" VALUE="11">
<quux type="hidden" NAME="qid" VALUE="12">

Output:

Yes8
Yes9
Yes10
Yes11
Yes12
Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
2

For $1 to contain a value, you need to use a Capture Group. Try:

$content =~ /<qid\"\s*value=([^>][0-9]+)/;
RobEarl
  • 7,862
  • 6
  • 35
  • 50
0

For the sample you give, your regular expression should look something like this:

$content =~ m{
               \"       # match a double quote
               qid      # match the string: qid
               \"       # match a double quote
               [^>]*    # match anything but the closing >
               value    # match the string: value
               \=       # match an equal sign
               \"       # match a double quote
               (\d+)    # capture a string of digits
               \"       # match a double quote
             }msx;
shawnhcorey
  • 3,545
  • 1
  • 15
  • 17
  • The `"`-quote is not a regex special char. Escaping it doesn't help ;-). The equal `=` isn't either. – rubber boots Jul 04 '13 at 14:00
  • It doesn't hurt either. In Perl's regular expressions, special meta-characters are non-alphanumerics and backslash-escaped alphanumerics. All non-meta-characters are backslash-escaped non-alphanumerics and alphanumerics. If you get into the habit of backslashing non-alphanumerics when you mean them literally, then you will avoid errors. – shawnhcorey Jul 04 '13 at 14:48