0

Try to retrieve the contents of a div from the external site withg PHP, and XPath

This is an excerpt from the page, showing the relevant code: note: i try to add all - also to add @ on the class and a at the end on my query, After that, i use saveHTML() to get it. see my test:

btw:

this is my XPath:  //*[@id="post-15991"]/div[4]/div[1]
this is the URL: https://wordpress.org/plugins/wp-job-manager/

see the subsequent code:

<?PHP
$url = 'https://wordpress.org/plugins/wp-job-manager/';
$dom = new DOMDocument();
@$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom);
$elements = $xpath->query('//*[@id="post-15991"]/div[4]/div[1]');
$link = $dom->saveHTML($elements->item(0));
echo $link;
?>

output: But the output is zero....

background:

my way to get the xpath; use google chrome: I have a webpage I want to get some data off:

https://wordpress.org/plugins/wp-job-manager/
https://wordpress.org/plugins/participants-database/
https://wordpress.org/plugins/amazon-link/
https://wordpress.org/plugins/simple-membership/
https://wordpress.org/plugins/scrapeazon/

goal: i need the following data:

Version:
Last updated:
Active installations:
Tested up

see for example the following - view-source:https://wordpress.org/plugins/wp-job-manager/

  • Version: 1.29.3
  • Last updated: 5 days ago
  • Active installations: 100,000+
  •                     <li>
            Requires WordPress Version:<strong>4.3.1</strong>                </li>
    
                        <li>Tested up to: <strong>4.9.2</strong></li>
    

    background: i need the data from all my favorite plugins - want to have it in a db or a calc sheet. So there were approx 70 pages to scrape:_

    see here the list for the example - the full xpath:

    //*[@id="post-15991"]/div[4]/div[1]
    

    and job-board-manager:

    //*[@id="post-519"]/div[4]/div[1]/ul/li[1]
    //*[@id="post-519"]/div[4]/div[1]/ul/li[2]
    //*[@id="post-519"]/div[4]/div[1]/ul/li[3]
    //*[@id="post-519"]/div[4]/div[1]/ul/li[7]
    

    i used this method: Is there a way to get the xpath in google chrome?

    Right click "inspect" on the item you are trying to find the xpath
    Right click on the highlighted area on the console.
    Go to Copy xpath
    
    zero
    • 1,003
    • 3
    • 20
    • 42

    1 Answers1

    1

    You are calling .loadHTMLFile which is expecting a file path. If you have your warning options on, you will see the following warnings:

    E_WARNING : type 2 -- DOMDocument::loadHTMLFile(): Attribute class redefined in https://wordpress.org/plugins/wp-job-manager/, line: 190 -- at line 5

    E_WARNING : type 2 -- DOMDocument::loadHTMLFile(): Tag header invalid in https://wordpress.org/plugins/wp-job-manager/, line: 201 -- at line 5

    E_WARNING : type 2 -- DOMDocument::loadHTMLFile(): Tag nav invalid in https://wordpress.org/plugins/wp-job-manager/, line: 205 -- at line 5

    E_WARNING : type 2 -- DOMDocument::loadHTMLFile(): Tag main invalid in https://wordpress.org/plugins/wp-job-manager/, line: 224 -- at line 5

    Instead, use .loadHTML.

    $url = 'https://wordpress.org/plugins/wp-job-manager/';
    $dom = new DOMDocument();
    @$dom->loadHTML($url);
    $xpath = new DOMXpath($dom);
    $elements = $xpath->query('//*[@id="post-15991"]/div[4]/div[1]');
    $link = $dom->saveHTML($elements->item(0));
    echo $link;
    

    And the result would be:

    https://wordpress.org/plugins/wp-job-manager/
    
    Community
    • 1
    • 1
    Chin Leung
    • 14,621
    • 3
    • 34
    • 58
    • hello and good day - many thanks - see the orginal post and the ** goal that i have** : i want to have the data : `Last updated: Active installations: Tested up` see for example the following - view-source:https://wordpress.org/plugins/wp-job-manager/ ` Version: 1.29.3 Last updated: 5 days ago Active installations: 100,000+` How to retrieve those results? – zero Feb 13 '18 at 00:09