Use regular expression to extract img tag from HTML in Perl

Question

I need to extract captcha from url and recognised it with Tesseract. My code is:

#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
###Add code here!
#Grab img from HTML code
#if ($html =~ /<img. *?src. *?>/)
#{
#    $img1 = $1;
#}
#else 
#{
#    $img1 = "";
#}
$img2 = grep(/<img. *src=.*>/,$html);
if ($html =~ /\img[^>]* src=\"([^\"]*)\"[^>]*/)
{
    my $takeImg = $1;
    my @dirs = split('/', $takeImg);
    my $img = $dirs[2];
}
else
{
    print "Image not found\n";
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$img' > ocr_me.img\n";
system "GET '$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;

As you see I`m trying extract img src tag. This solution did not work for me ($img1) use shell command tesseract in perl script to print a text output. Also I used adopted version of that solution($img2) How can I extract URL and link text from HTML in Perl?.

If you need HTMLcode from that page, here is:

<html>
<head>
<title>Perl test</title>
</head>
<body style="font: 18px Arial;">
<nobr>somenumbersimg src="/JJ822RCXHFC23OXONNHR.png" 
somenumbers<img src="/captcha/1533030599.png"/>
somenumbersimg src="/JJ822RCXHFC23OXONNHR.png" </nobr><br/><br/><form method="post" action="?u=user&p=pass">User: <input name="u"/><br/>PW: <input name="p"/><br/><input type="hidden" name="file" value="1533030599.png"/>Text: <input name="text"></br><input type="submit"></form><br/>
</body>
</html>

I got error that image not found. My problem is wrong regular expression, as I think.I can not install any modules such as HTTP::Parser or similar

Chris Turner · Accepted Answer · 2018-07-31T15:52:48.250

4

Aside from the fact that using regular expressions on HTML isn't very reliable, your regular expression in the following code isn't going to work because it's missing a capture group, so $1 won't be assigned a value.

if ($html =~ /<img. *?src. *?>/)
{
    $img = $1;
}

If you want to extract parts of text using a regular expression you need to put that part inside brackets. Like for example:

$example = "hello world";
$example =~ /(hello) world/;

this will set $1 to "hello".

The regular expression itself doesn't make that much sense - where you have ". *?", that'll match any character followed by 0 or more spaces. Is that a typo for ".*?" which would match any number of characters but isn't greedy like ".*", so will stop when it finds a match for the next part of the regex.

This regular expression is possibly closer to what you're looking for. It'll match the first img tag that has a src attribute that starts with "/captcha/" and store the image URL in $1

$html =~ m%<img[^>]*src="(/captcha/[^"]*)"%s;

To break it down how it works. The "m%....%" is just a different way of saying "/.../" that allows you to put slashes in the regex without needing to escape them. "[^>]*" will match zero or more of any character except ">" - so it won't match the end of the tag. And "(/captcha/[^"]*)" is using a capture group to grab anything inside the double quotes that will be the URL. It's also using the "/s" modifier on the end which will treat $html as if it is just one long line of text and ignoring any \n in it which probably isn't needed, but on the off chance the img tag is split over multiple lines it'll still work.

edited Jul 31 '18 at 15:52

answered Jul 31 '18 at 14:46

Chris Turner

8,082
1
14
18

The non-greedy version of `.*` is not `.?` but `.*?`. `.?` will match a single character that may not be there. – jja Jul 31 '18 at 15:47
@jja My answer is correct, but the *'s are being stripped out by the formatting and used to emphasis bits of it instead :/ I guess I'll need to work out how to fix that – Chris Turner Jul 31 '18 at 15:50
I thought something was off with the formatting. :) Try wrapping them in "`" so they don't get interpreted as formatting characters – jja Jul 31 '18 at 15:53
@jja I've opted for escaping them as that seems to work too – Chris Turner Jul 31 '18 at 15:54
@Chris Turner you did a very nice explanation, thanks. In output I see that ($img = 'GET /captcha/1533049803.png'). That $img Tesseract don`t recognize. Maybe I need 'GET 1533049803.png' instead of 'GET /captcha/1533049803.png'. Can you give me a regular expression to do this? I am trying to edit your regular expression to achieve this, but did not succeed – Dmitriy Kisil Jul 31 '18 at 16:04
1

@Oysiyl That is the exact opposite of your problem, you want to specify more of the URL, not less. You want to do (GET 'http ://perltest.adavice.com/captcha/1533049803.png') – Chris Turner Jul 31 '18 at 16:13
(except without a space after "http" as that is just me trying to stop it from treating it like an actual URL in my comment) – Chris Turner Jul 31 '18 at 16:15
@Chris Turner yes! I want do this but don`t understand how to add text('http:/perltest.adavice.com') to my $img(/captcha/1533049803.png). I am tried to concatenate $url and $img but this dont work as I expected – Dmitriy Kisil Jul 31 '18 at 16:18
1

You've got `system "GET '$img' > ocr_me.img";` so make it `system "GET '$home$img' > ocr_me.img";` – Chris Turner Jul 31 '18 at 16:25
@Chris Turner, yep programm correctly load image. But Tesseract dont recognize anything(GET '' > ocr_result.txt. Captcha text dont specified). In any case, the answer to the question is received. Probably, I will look for more information about Tesseract. If I can not cope with the error, I will ask a new question on site. Thanks for your help – Dmitriy Kisil Jul 31 '18 at 17:07

Use regular expression to extract img tag from HTML in Perl

1 Answers1