1

I have my script to monitor some Facebook pages. Since Facebook API banned page public access permission on 4-SEP-2019. I need to parse the content by xpath method.

Each Facebook post is wrap by div[contains(@class,"userContentWrapper")]. I would like to loop posts one by one to find a desired data.

I don't known why $message = $post->findvalue('//div[@data-testid="post_message"]//p'); show all text in <p> of every posts.

use LWP::UserAgent;
$ua       = new LWP::UserAgent;
$request  = new HTTP::Request;
$request->url('https://www.facebook.com/pg/FIFA/posts/');
$request->method('GET');
$request->header('User-Agent' => 'Mozilla/5.0 Chrome/71.0.3578.98 Safari/537.36');
$response = $ua->request($request);


open(HTM, ">zzz.htm");
print HTM $response->content;
close(HTM);


use HTML::TreeBuilder::XPath;
$tree = HTML::TreeBuilder::XPath->new_from_content($response->content);


$posts = $tree->findnodes('//div[contains(@class,"userContentWrapper")]');


for my $post (@{$posts})
{
    $id =  $post->findnodes('//div[@data-testid="story-subtitle"]/@id');
    $id =  $id->[0]->getValue;
    print "id = $id\n\n";

    $object_id =  $post->findnodes('//div[@data-testid="story-subtitle"]//a/@href');
    $object_id =  'https://www.facebook.com' . $object_id->[0]->getValue;
    print "object_id = $object_id\n\n";

    $message = $post->findvalue('//div[@data-testid="post_message"]//p');
#   $message = $message->[0]->getValue;
    print "$message\n\n";

    $ajaxify =  $post->findnodes('//div[@class="mtm"]//a/@ajaxify');
    $ajaxify =  $ajaxify->[0]->getValue;
    print "ajaxify = $ajaxify\n\n";

    $ploi = $post->findnodes('//div[@class="mtm"]//a/@data-ploi');
    $ploi = $ploi->[0]->getValue;
    print "ploi = $ploi\n\n";

#   $plsi = $post->findnodes('//div[@class="mtm"]//a/@data-plsi');
#   $plsi = $plsi->[0]->getValue;
#   print "plsi = $plsi\n\n";

    $href =  $post->findnodes('//div[@class="mtm"]//a/@href');
    $href =  'https://www.facebook.com' . $href->[0]->getValue;
    print "href = $href\n\n";

    print "---------------------------------------------------------\n\n";
}
  • 2
    When calling `$post->findnodes` the XPath expression should probably start with a dot, e.g.: `'.//div[@data-testid="story-subtitle"]//a/@href'`. The dot make it a relative path. Starting with a slash means it starts at the top of the document. – Grant McLean Sep 05 '19 at 20:35
  • Thank, Work now. So `$post->findnodes` still known all data equivalent with `$tree`. – ต้อง เอกมัย Sep 06 '19 at 03:37
  • 1
    @ต้อง เอกมัย, A leading `/` means start at the root just like with file paths. `//div[...]/...` is short for `/descendant:div[...]/...`. If you want to start with a child of the context node, simply use `div[...]/...`. If you want to start with a descendant of the current node use `descendant:div[...]/...` or `.//div[...]/...`. – ikegami Sep 06 '19 at 09:41

1 Answers1

3

The post is unclear and it seems to contain multiple questions. This needs to be fixed, but in the mean time, I'll address the following:

I would like to loop posts one by one to find a desired data.


From HTML::TreeBuilder::XPath,

findnodes ($path)

Returns a list of nodes found by $path. In scalar context returns an Tree::XPathEngine::NodeSet object.

From Tree::XPathEngine::NodeSet,

get_nodelist()

Returns a list of nodes. See Tree::XPathEngine::XMLParser for the format of the nodes.

So,

my @posts = $tree->findnodes('...');
for my $post (@posts) { ... }

or

my $posts = $tree->findnodes('...');
for my $post ($posts->get_nodelist()) { ... }

Any other questions should be posted as separate Questions.

Community
  • 1
  • 1
ikegami
  • 367,544
  • 15
  • 269
  • 518