0

I need some help on string parsing in perl. I've an http server that respond with something like this:

<html>
<head><title></title></head><body>
T:17.10;H:32.10
</body></html>

I need to catch the two numbers (in the example 17.10 and 32.10) and put them in two variables that I will use for do some if...then...else cycle.

I'm not so expert in string manipulation and regex, at the moment I'm tring to do this:

my $url = 'http://192.168.25.9';
my $content = get $url;
die "Couldn't get $url" unless defined $content;
my @lines = split /\n/, $content;
$content2 = $lines[2];
$content2 =~ tr/T://d;
$content2 =~ tr/H://d;
my @lines2 = split /;/, $content2;
$tem = $lines2[0];
$hum = $lines2[1];

$tem =~ m{(\d+\.\d+)};
$hum =~ m{(\d+\.\d+)};

but when I print out the line I see something strange: characters missing, space in the line, etc. It seems that I've some strange invisible characters that create confusion.

Could you suggest me a better way for have the two number in two numeric variables?

Thanks Fabio

Fabio
  • 47
  • 5

3 Answers3

6

A complete solution, avoiding parsing HTML with REGEX (ref: RegEx match open tags except XHTML self-contained tags ) :

use strict; use warnings;

# base perl module to fetch HTML
use LWP::UserAgent;
# base perl module to parse HTML
use HTML::TreeBuilder;

# fetching part
my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET => "http://192.168.25.9");
my $res = $ua->request($req);
die $res->status_line, "\n" unless $res->is_success;

# parsing part
my $tree = HTML::TreeBuilder->new();
# get text from HTML
my $out = $tree->parse($res->decoded_content)->format;
# extract the expected string from the text output
if ($out =~ /^\s*T:(\d{2}\.\d{2});H:(\d{2}\.\d{2}).*/) {
    print join "\n", $1, $2;
}

OUTPUT:

17.10
32.10
Community
  • 1
  • 1
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • 1
    I don't see any point in involving `HTML::TreeBuilder` at all -- certainly not just for the purpose of formatting the HTML. It would also be wise to use `decoded_content` instead of `content`, as you don't know whether the HTTP content is compressed. – Borodin Jan 13 '15 at 21:25
2

Specifically for such requests you can do so:

my ($t, $h) = map { (/T:(\d+|\d+.\d+);H:(\d+|\d+.\d+)/)?($1, $2):() } @req;
print "$t, $h\n", $t * $h;

Output:

17.10, 32.10
548.91

where @req is an array with chomped strings of the received request

red0ct
  • 4,840
  • 3
  • 17
  • 44
  • 1
    This solution seems fine to me, and I don't understand the downvote. The other solutions go into a lot of unnecessary work to format or strip the HTML, which is entirely unnecessary unless you wanted to ensure that the required text is the sole content of the `` element, which none of them do. Your regex is a little naive though, as I think there is a good chance that the numeric values may look like `123.4` or `2.8896`, or even `42`, and your pattern will match none of these. – Borodin Jan 13 '15 at 21:36
  • Thanks for the support, Borodin. Agree with your comments. Fixed regexp. Perhaps now it's more flexible. – red0ct Jan 13 '15 at 22:07
1

For your purpose, this is all you need:

my ($tem, $hum) = $content =~ /T:(\d{2}\.\d{2});H:(\d{2}\.\d{2})/;

If you need more general parse (e.g. to support a temperature or humidity >= 100, single digit values, etc...):

my ($tem, $hum) = $content =~ /T:(\d+(?:\.\d+)?);H:(\d+(?:\.\d+)?)/;
Ben Grimm
  • 4,316
  • 2
  • 15
  • 24
  • I don't see any need to remove the HTML markup. The data is either there or it is not, although I guess there is an infinitesimal chance that a false match could be found in the value of one of the attributes. – Borodin Jan 13 '15 at 21:30
  • That's true, he's not really parsing html here. The answer just comes down to the one line: `my ($tem, $hum) = $content =~ /T:(\d{2}.\d{2});H:(\d{2}.\d{2})/;` – Ben Grimm Jan 13 '15 at 23:33