1

Perl WWW::Mechanize::Firefox has successfully retrieved the contents of the web page, and stored in the scalar variable $content.

my $url = 'http://finance.yahoo.com/quote/AAPL/financials?p=AAPL';
$mech->get($url);
my $content= $mech->content();

In examining $content, I'm interested in identifying and saving all the information between the span tags inside the table.

There a varies classes that I have no interest in.

Attempt # 1 did not work.

my $tree = HTML::TreeBuilder->new_from_content($txtRawData);    
my @list = $mech->find('span');

foreach ( @list ) {
print $_->as_HTML();
}

Attempt # 2 did not work.

foreach my $tag ($tree->look_down(_tag => 'span')) {
    my $value = $tag->as_text;  
}

The HTML table of interest is:

<div class="Mt(10px)">
    <table class="Lh(1.7) W(100%) M(0)">
        <tbody>
            <tr class="Bdbw(1px) Bdbc($lightGray) Bdbs(s) H(36px)">
                <td class="Fw(b) Fz(15px)">
                    <span>Revenue</span>
                </td>

                <td class="C($gray) Ta(end)">
                    <span>9/24/2016</span>
                </td>

                <td class="C($gray) Ta(end)">
                    <span>9/26/2015</span>
                </td>

                <td class="C($gray) Ta(end)">
                    <span>9/27/2014</span>
                </td>
            </tr>

            <tr class="Bdbw(1px) Bdbc($lightGray) Bdbs(s) H(36px)">
                <td class="Fz(s) H(35px) Va(m)">
                    <span>Total Revenue</span>
                </td>

                <td class="Fz(s) Ta(end)">
                    <span>
                        <span>215,639,000</span>
                    </span>
                </td>

                <td class="Fz(s) Ta(end)">
                    <span>
                        <span>233,715,000</span>
                    </span>
                </td>

                <td class="Fz(s) Ta(end)">
                    <span>
                        <span>182,795,000</span>
                    </span>
                </td>
            </tr>

            <tr class="Bdbw(1px) Bdbc($lightGray) Bdbs(s) H(36px)">
                <td class="Fz(s) H(35px) Va(m)">
                    <span>Cost of Revenue</span>
                </td>

                <td class="Fz(s) Ta(end)">
                    <span>
                        <span>131,376,000</span>
                    </span>
                </td>

                <td class="Fz(s) Ta(end)">
                    <span>
                        <span>140,089,000</span>
                    </span>
                </td>

                <td class="Fz(s) Ta(end)">
                    <span>
                        <span>112,258,000</span>
                    </span>
                </td>
            </tr>

            <tr class="Bdbw(0px)! H(36px)">
                <td class="Fw(b) Fz(s) Pb(20px)">
                    <span>Gross Profit</span>
                </td>

                <td class="Fw(b) Fz(s) Ta(end) Pb(20px)">
                    <span>
                        <span>84,263,000</span>
                    </span>
                </td>

                <td class="Fw(b) Fz(s) Ta(end) Pb(20px)">
                    <span>
                        <span>93,626,000</span>
                    </span>
                </td>

                <td class="Fw(b) Fz(s) Ta(end) Pb(20px)">
                    <span>
                        <span>70,537,000</span>
                    </span>
                </td>
            </tr>
        </tbody>
    </table>
</div>

What is the best way to select (set focus upon) one specific table (their could be multiple tables inside the $content variable), and save the text between the span tags to an array (to be passed to the next procedure - to be inserted into a database table)?

I also would like to highlight that:

  1. Sometimes, the text is inside a two (double) sets of span tags.
  2. There is no table header row (or th tags).
simbabque
  • 53,749
  • 8
  • 73
  • 136

1 Answers1

1

Your first attempt works if you actually do it on $tree and not on $mech. Combined with as_text from your second attempt is pretty nice.

use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_content(my @foo = <DATA>);
my @list = $tree->find('span');

foreach ( @list ) {
    say $_->as_text();
}
__DATA__
<div class="Mt(10px)">
    <table class="Lh(1.7) W(100%) M(0)">
...

This outputs a list of span contents. You should be able to clean those up and work with them.

Revenue
9/24/2016
9/26/2015
9/27/2014
...

Of course as an actual table (array-of-arrays) it would probably make more sense, but for that we'd have to know what it is you are trying to do.

simbabque
  • 53,749
  • 8
  • 73
  • 136
  • It returns the outputs WITH the span tag still attached: `Revenue 9/24/2016 9/26/2015 9/27/2014 Total Revenue 215,639,000 215,639,000 233,715,000 233,715,000 182,795,000 182,795,000` foreach ( @list ) { my $value = $_->as_HTML; print "$value\n"; my $clean = $hs->parse($value); } – Brian Douglas Mar 14 '17 at 00:18
  • @BrianDouglas that's why I combined both your attempts and used the text one in my code. Read it again. ;-) I'll highlight it. – simbabque Mar 14 '17 at 07:20