How to loop the result from findnodes() with HTML::TreeBuilder::XPath

Question

I have my script to monitor some Facebook pages. Since Facebook API banned page public access permission on 4-SEP-2019. I need to parse the content by xpath method.

Each Facebook post is wrap by div[contains(@class,"userContentWrapper")]. I would like to loop posts one by one to find a desired data.

I don't known why $message = $post->findvalue('//div[@data-testid="post_message"]//p'); show all text in <p> of every posts.

use LWP::UserAgent;
$ua       = new LWP::UserAgent;
$request  = new HTTP::Request;
$request->url('https://www.facebook.com/pg/FIFA/posts/');
$request->method('GET');
$request->header('User-Agent' => 'Mozilla/5.0 Chrome/71.0.3578.98 Safari/537.36');
$response = $ua->request($request);


open(HTM, ">zzz.htm");
print HTM $response->content;
close(HTM);


use HTML::TreeBuilder::XPath;
$tree = HTML::TreeBuilder::XPath->new_from_content($response->content);


$posts = $tree->findnodes('//div[contains(@class,"userContentWrapper")]');


for my $post (@{$posts})
{
    $id =  $post->findnodes('//div[@data-testid="story-subtitle"]/@id');
    $id =  $id->[0]->getValue;
    print "id = $id\n\n";

    $object_id =  $post->findnodes('//div[@data-testid="story-subtitle"]//a/@href');
    $object_id =  'https://www.facebook.com' . $object_id->[0]->getValue;
    print "object_id = $object_id\n\n";

    $message = $post->findvalue('//div[@data-testid="post_message"]//p');
#   $message = $message->[0]->getValue;
    print "$message\n\n";

    $ajaxify =  $post->findnodes('//div[@class="mtm"]//a/@ajaxify');
    $ajaxify =  $ajaxify->[0]->getValue;
    print "ajaxify = $ajaxify\n\n";

    $ploi = $post->findnodes('//div[@class="mtm"]//a/@data-ploi');
    $ploi = $ploi->[0]->getValue;
    print "ploi = $ploi\n\n";

#   $plsi = $post->findnodes('//div[@class="mtm"]//a/@data-plsi');
#   $plsi = $plsi->[0]->getValue;
#   print "plsi = $plsi\n\n";

    $href =  $post->findnodes('//div[@class="mtm"]//a/@href');
    $href =  'https://www.facebook.com' . $href->[0]->getValue;
    print "href = $href\n\n";

    print "---------------------------------------------------------\n\n";
}

When calling `$post->findnodes` the XPath expression should probably start with a dot, e.g.: `'.//div[@data-testid="story-subtitle"]//a/@href'`. The dot make it a relative path. Starting with a slash means it starts at the top of the document. — Grant McLean, Sep 05 '19 at 20:35
Thank, Work now. So `$post->findnodes` still known all data equivalent with `$tree`. — ต้อง เอกมัย, Sep 06 '19 at 03:37
@ต้อง เอกมัย, A leading `/` means start at the root just like with file paths. `//div[...]/...` is short for `/descendant:div[...]/...`. If you want to start with a child of the context node, simply use `div[...]/...`. If you want to start with a descendant of the current node use `descendant:div[...]/...` or `.//div[...]/...`. — ikegami, Sep 06 '19 at 09:41

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

3

The post is unclear and it seems to contain multiple questions. This needs to be fixed, but in the mean time, I'll address the following:

I would like to loop posts one by one to find a desired data.

From HTML::TreeBuilder::XPath,

findnodes ($path)

Returns a list of nodes found by $path. In scalar context returns an Tree::XPathEngine::NodeSet object.

From Tree::XPathEngine::NodeSet,

get_nodelist()

Returns a list of nodes. See Tree::XPathEngine::XMLParser for the format of the nodes.

So,

my @posts = $tree->findnodes('...');
for my $post (@posts) { ... }

or

my $posts = $tree->findnodes('...');
for my $post ($posts->get_nodelist()) { ... }

Any other questions should be posted as separate Questions.

edited Jun 20 '20 at 09:12

Community

1
1

answered Sep 05 '19 at 18:24

ikegami

367,544
15
269
518

I have check `print $post->as_HTML();` is valid. It changed in every loop. But `$id` is the same for every loop. `$message` also weird too. – ต้อง เอกมัย Sep 05 '19 at 19:23
I don't understand. Is that a new question? – ikegami Sep 05 '19 at 21:08

How to loop the result from findnodes() with HTML::TreeBuilder::XPath

1 Answers1