How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

Question

I am attempting to scrape a website using the DOMXPath query method. I have successfully scraped the 20 profile URLs of each News Anchor from this page.

$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n){
    $value = $n->nodeValue;
    $profileurl[] = $value;

    }

I used the resulting array as the URL to scrape data from each of the News Anchor's bio pages.

$imgurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//img[@class='photo fn']/@src");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $imgurl[] = $value;
        }
    }

Each News Anchor profile page has 6 xPaths I need to scrape (the $imgurl array is one of them). I am then sending this scraped data to MySQL.

So far, everything works great - except when I attempt to get the Twitter URL from each profile because this element isn't found on every News Anchor profile page. This results in MySQL receiving 5 columns with 20 full rows and 1 column (twitterurl) with 18 rows of data. Those 18 rows are not lined up with the other data correctly because if the xPath doesn't exist, it seems to be skipped.

How do I account for missing xPaths? Looking for an answer, I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?

Here's the query for the Twitter URLs:

$twitterurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//*[@id='bio']/div[2]/p[3]/a/@href");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $twitterurl[] = $value;
        }
    }

Web scraping is data collection, not data mining (as in: statistical methods for advanced data analysis). There is an appropriate tag - [tag:web-scraping]. — Has QUIT--Anony-Mousse, Oct 14 '14 at 20:49

score 1 · Accepted Answer · answered Oct 14 '14 at 20:48

1

Since the twitter node appears zero or one times, change the foreach to

$twitterurl [] = $nodelist->length ? $nodelist->item(0)->nodeValue : NULL;

That will keep the contents in sync. You will, however, have to make arrangements to handle NULL values in the query you use to insert them in the database.

answered Oct 14 '14 at 20:48

Mike

2,721
1
15
20

Thank you for your suggestion. I changed the code as you advised, but nothing appears to have changed. I did however find a workaround for now. I moved `$twitterurl[] = $value;` outside of the foreach loop and it returns 20 results in the array now. – point71echo Oct 14 '14 at 23:02
That line was instead of the foreach block, which it sounds like you have done. – Mike Oct 15 '14 at 07:38
Your last comment cleared up my misunderstanding on your first response. I tried the code you suggested again by replacing the foreach block completely and it worked better than my previous solution. Thank you for your guidance. – point71echo Oct 15 '14 at 17:23
You're welcome. Beware: your previous solution would have created the correct number of entries but would have used the previous value if a new one wasn't found. This mistake is so common, there should be a name for it. – Mike Oct 15 '14 at 19:42

score 1 · Answer 2 · answered Oct 19 '14 at 09:26

I think you have multiple issues in the way you scrape the data and will try to outline those in my answer in the hope it always clarifies your central question:

I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?

First of all collecting the URLs of each profile (detail) page is a good idea. You can even benefit more from it by putting this into the overall context of your scraping job:

* profile pages
     `- profile page
          +- name
          +- role
          +- img
          +- email
          +- facebook
          `- twitter

This is the structure you have with the data you like to obtain. You already managed to obtain all profile pages URLs:

$url   = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath    = new DOMXPath($html);
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n) {
    $value        = $n->nodeValue;
    $profileurl[] = $value;
}

As you know that the next steps would be to load and query the 20+ profile pages, one of the very first things you could do is to extract the part of your code that is creating a DOMXPath from an URL into a function of it's own. This will also allow you to do better error handling easily:

/**
 * @param string $url
 *
 * @throws RuntimeException
 * @return DOMXPath
 */
function xpath_from_url($url)
{
    $html   = new DOMDocument();
    $saved  = libxml_use_internal_errors(true);
    $result = $html->loadHtmlFile($url);
    libxml_use_internal_errors($saved);
    if (!$result) {
        throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
    }
    $xpath = new DOMXPath($html);
    return $xpath;
}

This changes the main processing into a more compressed form then only by the extraction (move) of the code into the xpath_from_url function:

$xpath    = xpath_from_url($url);
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n) {
    $value        = $n->nodeValue;
    $profileurl[] = $value;
}

But it does also allow you another change to the code: You can now process the URLs directly in the structure of your main extraction routine:

$url = "http://www.sandiego6.com/about-us/meet-our-team";

$xpath       = xpath_from_url($url);
$profileUrls = $xpath->query("//p[@class='bio']/a/@href");
foreach ($profileUrls as $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    // ... extract the six (inkl. optional) values from a profile
}

As you can see, this code skips creating the array of profile-URLs because a collection of all profile-URLs are already given by the first xpath operation.

Now there is the part missing to extract the up to six fields from the detail page. With this new way to iterate over the profile URLs, this is pretty easy to manage - just create one xpath expression for each field and fetch the data. If you make use of DOMXPath::evaluate instead of DOMXPath::querythen you can get string values directly. The string-value of a non-existing node, is an empty string. This is not really testing if the node exists or not, in case you need NULL instead of "" (empty string), this needs to be done differently (I can show that, too, but that's not the point right now). In the following example the anchors name and role is being extracted:

foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1,
        $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    // ... extract the other four (inkl. optional) values from a profile
}

I choose to directly output the values (and not care about adding them into an array or a similar structure), so that it's easy to follow what happens:

#01: Marc Bailey (Morning Anchor)
#02: Heather Myers (Morning Anchor)
#03: Jim Patton (10pm Anchor)
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
...

Fetching the details about email, facebook and twitter works the same:

foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1,
        $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    printf(
        "  email...: %s\n",
        $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")')
    );
    printf(
        "  facebook: %s\n",
        $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)')
    );
    printf(
        "  twitter.: %s\n",
        $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)')
    );
}

This now already outputs the data as you need it (I've left the images out because those can't be well displayed in text-mode:

#01: Marc Bailey (Morning Anchor)
  email...: m.bailey@sandiego6.com
  facebook: https://www.facebook.com/marc.baileySD6
  twitter.: http://www.twitter.com/MarcBaileySD6
#02: Heather Myers (Morning Anchor)
  email...: heather.myers@sandiego6.com
  facebook: https://www.facebook.com/heather.myersSD6
  twitter.: http://www.twitter.com/HeatherMyersSD6
#03: Jim Patton (10pm Anchor)
  email...: jim.patton@sandiego6.com
  facebook: https://www.facebook.com/Jim.PattonSD6
  twitter.: http://www.twitter.com/JimPattonSD6
#04: Neda Iranpour (10 p.m. Anchor / Reporter)
  email...: Neda.Iranpour@sandiego6.com
  facebook: https://www.facebook.com/lightenupwithneda
  twitter.: http://www.twitter.com/@LightenUpWNeda
...

So now these little lines of code with one foreach loop already fairly well represent the original structure outlined:

* profile pages
     `- profile page
          +- name
          +- role
          +- img
          +- email
          +- facebook
          `- twitter

All you have to do is just to follow that overall structure of how the data is available with your code. Then at the end when you see that all data can be obtained as wished, you do the store operation in the database: one insert per profile. that is one row per profile. you don't have to keep the whole data, you can just insert (perhaps with some check if it already exists) the data for each row.

Hope that helps.

Appendix: Code in full

<?php
/**
 * Scraping detail pages based on index page
 */

/**
 * @param string $url
 *
 * @throws RuntimeException
 * @return DOMXPath
 */
function xpath_from_url($url)
{
    $html   = new DOMDocument();
    $saved  = libxml_use_internal_errors(true);
    $result = $html->loadHtmlFile($url);
    libxml_use_internal_errors($saved);
    if (!$result) {
        throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url));
    }
    $xpath = new DOMXPath($html);
    return $xpath;
}

$url = "http://www.sandiego6.com/about-us/meet-our-team";

$xpath       = xpath_from_url($url);
$profileUrls = $xpath->query("//p[@class='bio']/a/@href");
foreach ($profileUrls as $i => $profileUrl) {
    $profile = xpath_from_url($profileUrl->nodeValue);
    printf(
        "#%02d: %s (%s)\n", $i + 1, $profile->evaluate('normalize-space(//h1[@class="entry-title"])'),
        $profile->evaluate('normalize-space(//h2[@class="fn"])')
    );
    printf("  email...: %s\n", $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")'));
    printf("  facebook: %s\n", $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)'));
    printf("  twitter.: %s\n", $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)'));
}

I tried the code out and it works great. Thank you for your extremely helpful and thorough suggestion. As soon as I have enough reputation points, I'll be sure to give you a well-deserved + 1 for your input. Is it possible to include each profileUrl in the output? I want to use that as my Primary Key for the SQL table. — point71echo, Oct 19 '14 at 23:55
@point71echo: You should be able to obtain it from `$profile->document->documentURI`, see http://php.net/domxpath for all fields of **DOMXPath**, it also cross-links the **DOMDocument** docs which are showing the `documentURI` field. — hakre, Oct 20 '14 at 07:39
That worked perfectly as well. Thank you for all of your help. — point71echo, Oct 20 '14 at 17:10

How do I account for missing xPaths and keep my data uniform when scraping a website using DOMXPath query method?

2 Answers2

Linked