Scraping output in nested PHP foreach loops

Question

I am trying to scrape a list of links in this format using DOM:

<h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 2</a></h2>
<h2 class="h2"><a href="and-another">List item 3</a></h2>

I need to have both the contents of the <h2> (e.g. "List item 1") and the accompanying href (e.g. "this-is-a-link") as variables in PHP.

I can scrape each one separately using a foreach loop but once I try to print both at once by nesting foreach loops, I get each <h2> repeating itself several times.

Am I on the right track, or is there a better way to go about this?

Edit

I should say that I'm scraping a variety of sites and some have the format above but for others the <a> is else where, e.g. in the containing div.

Here is my code:

    function jobscrape($name, $url, $jobpage_url_root, $job_title_location, $job_title_url_location, $job_text) {

    echo "<h3>".$name."</h3>";

    // CREATE NEW DOM DOCUMENT BASED ON JOBLIST URL
    $html = file_get_contents($url);
    $doc = new DOMDocument();
    libxml_use_internal_errors(TRUE);

    // CHECK IF ANY HTML IS RETURNED (I.E. IF ABOVE HAS WORKED)
    if(!empty($html)) {

        // LOAD HTML INTO DOM DOCUMENT, CREATE NEW XPATH AND SET VARIABLE FOR THE JOB TITLE LOCATION
        $doc->loadHTML($html);
        libxml_clear_errors(); // remove errors for yucky html
        $xpath = new DOMXPath($doc);

// LOOP THROUGH JOBS LIST
        $row = $xpath->query("$job_title_location");
        // CHECK IF THERE ARE ANY ROWS MATCHING THE ABOVE LOCATION
        if ($row->length > 0) {
            // PULL THOSE ROWS INTO AN ARRAY
            foreach ($row as $jobpage_titles) {
                // SET THE JOBPAGE TITLE VARIABLE
                $jobpage_title = $jobpage_titles->nodeValue;
                // echo $jobpage_title."<br>";

// LOOP THROUGH JOBS PAGE URLS
                $row2 = $xpath->query("$job_title_url_location");
                // CHECK IF THERE ARE ANY ROWS MATCHING THE ABOVE LOCATION
                if ($row2->length > 0) {
                    //echo $jobpage_title." - hello<br>";
                    // PULL THOSE ROWS INTO AN ARRAY
                    foreach ($row2 as $jobpage_urls) {
                        // TRY TO PRINT VARIABLE FROM BEFORE
                        $href = $jobpage_url_root.$jobpage_urls->attributes->getNamedItem('href')->value;
                        echo "<a href='".$href."'>".$jobpage_title."</a><br>";
                    }
                }
            }
        }
    }
}

My out put is each list element printed one time for every URL, e.g.:

<h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 1</a></h2>
<h2 class="h2"><a href="and-another">List item 1</a></h2>

<h2 class="h2"><a href="this-is-a-link">List item 2</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 2</a></h2>
<h2 class="h2"><a href="and-another">List item 2</a></h2>

<h2 class="h2"><a href="this-is-a-link">List item 3</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 3</a></h2>
<h2 class="h2"><a href="and-another">List item 3</a></h2>

Just on a bigger scale because I'm scraping more than three things.

please show what you've tried so far, in which you get the unexpected output. also paste said unexpected output as is. unless today is your lucky day, nobody is going to write code from scratch for you. — skrilled, Feb 21 '14 at 00:31

score 0 · Accepted Answer · answered Feb 21 '14 at 00:33

0

You probably don't need to nest foreach loops in this case. Since you're getting the href attribute of an element and the text node of the same element, it could be done in the same iteration through the loop with no nesting.

answered Feb 21 '14 at 00:33

Robbie McAlister

61
4

Sorry, I should have said - I'm scraping a variety of sites. Some have the format above but for others the `` is else where, e.g. in the containing div. – Sebastian Feb 21 '14 at 00:34
Oh sorry. In any case, if the is a child element of the
, you can still avoid the nested looping. You'll only need to loop through the outer elements and then find the nodes you're looking for inside with regex as mentioned above or another dom navigation method.
– Robbie McAlister Feb 21 '14 at 00:40
Does that still allow for scalability? For example, if I wanted to include a title, the title's link *and* a description? – Sebastian Feb 21 '14 at 00:43
Yes. There is nothing unscalable about grabbing those specific items in each iteration. The nested loops are less scalable because as the page grows you'll greatly multiply the number of passes required for your desired output. – Robbie McAlister Feb 21 '14 at 01:23
After reading your edited code above, your outer loop should store the nodes based on the identified URLs in a temporary array without attempting to loop through them there. Then in another loop, walk over that array to output what you need. You will have one loop to identify the items and then another loop to process your output. – Robbie McAlister Feb 21 '14 at 01:29

score 0 · Answer 2 · answered Feb 21 '14 at 00:36

0

Have you've considered using regex for scraping links?

preg_match_all('#<h2 class="h2"><a href="(.*)">(.*)</a></h2>#',$string,$matches);
foreach($matches[1] as $key=>$value)
  echo $value . " = " . $matches[2][$key] . "<br >";

answered Feb 21 '14 at 00:36

Curtis W

511
4
11

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – skrilled Feb 21 '14 at 00:42
While I agree that HTML can't be parsed by regex, if you only need a specific snippet of the code, regex can serve the purpose. – dab Feb 21 '14 at 00:47

score 0 · Answer 3 · answered Feb 21 '14 at 00:46

You can use Regular expressions for something like this loop through each line and put it into $string. Then you can do something like this:

<?php

$string = '<h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>';

preg_match('/^<h2 class="h2">(<a href="[-A-Z0-9_.]+">)([-A-Z0-9 ._]+)<\/a><\/h2>$/i', $string, $matches);

print "<pre>"; print_r($matches); print "</pre>";

That will output:

Array
(
    [0] => <h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>
    [1] => <a href="this-is-a-link">
    [2] => List item 1
)

The items you want would be stored in $matches[1] and $matches[2].

I couldn't tell if you wanted the ` – Quixrick Feb 21 '14 at 00:48 — Quixrick, Feb 21 '14 at 00:48

Scraping output in nested PHP foreach loops

3 Answers3

, you can still avoid the nested looping. You'll only need to loop through the outer elements and then find the nodes you're looking for inside with regex as mentioned above or another dom navigation method.