1

I am parsing a real estate web page, using HTML::TreeBuilder, and have the following code:

$values{"Pcity"} = $address->look_down("_tag" => "span", 
                   "itemprop" => "addressLocality")->as_text;
$values{"PState"} = $address->look_down("_tag" => "span", 
                   "itemprop" => "addressRegion")->as_text;

Some pages don't contain city or state, and the parser exits with an error:

Can't call method "as_text" on an undefined value

To fix it I used the following method:

$values{"Pcity"} = $address->look_down("_tag" => "span", 
                   "itemprop" => "addressLocality");
if(defined($values{"Pcity"}))
{
    $values{"Pcity"} = $values{"Pcity"}->as_text;
}
else
{
    $values{"Pcity"} = '';
}

It works, but now instead of 1 line I have 9. And as I have many places like this the code will become considerably bigger.

Is there any way to optimize?

user4035
  • 22,508
  • 11
  • 59
  • 94

2 Answers2

2

Assuming that $address never contains more than one <span> with either of the given values for the itemprop attribute, you could write this

for my $span ( $address->look_down(_tag => 'span') ) {
   my $itemprop    = $span->attr('itemprop');
   $values{Pcity}  = $span->as_text if $itemprop eq 'addressLocality';
   $values{PState} = $span->as_text if $itemprop eq 'addressRegion';
}

But accessing HTML trees is made much more simple by the use of HTML::TreeBuilder::XPath, which allows the structure to be indexed using XPath expressions instead of the clumsy look_down. A solution using it would look like this, with the proviso that findvalue returns an empty string '' for non-existent nodes, rather than undef; but that should be workable for you as it still evaluates to false.

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my $xp = HTML::TreeBuilder::XPath->new_from_file(*DATA);

my %values;

$values{Pcity}  = $xp->findvalue('//span[@itemprop="addressLocality"]');
$values{PState} = $xp->findvalue('//span[@itemprop="addressRegion"]');

use Data::Dump;
dd \%values;

__DATA__
<html>
<head>
  <title>Title</title>
</head>
<body>
  <span itemprop="addressLocality">My Locality</span>
  <span itemprop="addressRegion">My Region</span>
</body>
</html>

output

{ Pcity => "My Locality", PState => "My Region" }
Borodin
  • 126,100
  • 9
  • 70
  • 144
1

This is shorter:

$a = $address->look_down("_tag" => "span", "itemprop" => "addressLocality");
$values{"Pcity"} = $a ? $a->as_text : '';
Miguel Prz
  • 13,718
  • 29
  • 42
  • That could be written `$values{Pcity} = $a && $a->as_text` with the minimal difference that `$values{Pcity}` ends up as `undef` instead of the null string if the span element isn't found. But note that you should never use either `$a` or `$b` anywhere except within the block of a `sort` statement as those identifiers are reserved. – Borodin Sep 06 '14 at 23:06
  • in this example the use of $a hasn't side effects – Miguel Prz Sep 07 '14 at 07:06
  • `$a` and `$b` should be avoided as a matter of course, in the same way that the `strict` and `warnings` pragma should always be included, even if they make no difference to a given program. There is nothing to say that a modification may not add a call to `sort`, and introduce obscure bugs. In this instance you have no idea whether the OP's code already contains `sort`. Aside from that, `$a` is useless as an identifier as it is meaningless. – Borodin Sep 07 '14 at 12:57
  • as far as there is not state between method calls saved in $a, there is nothing unsafe in the use of $a in that way – Miguel Prz Sep 07 '14 at 13:24
  • Look, it's your program and you can code it any way you want. But the risk is there; it's common wisdom to avoid those variables; and if you want to take that risk for the sake of using a useless and inappropriate identifier then go ahead. Just, *please*, don't give it as an example on Stack Overflow to others who don't know any better and, if asked, would rather not take the risk. You should add a warning to your posts if you continue to do this. – Borodin Sep 07 '14 at 13:26