1

I am trying to scrape a list of links in this format using DOM:

<h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 2</a></h2>
<h2 class="h2"><a href="and-another">List item 3</a></h2>

I need to have both the contents of the <h2> (e.g. "List item 1") and the accompanying href (e.g. "this-is-a-link") as variables in PHP.

I can scrape each one separately using a foreach loop but once I try to print both at once by nesting foreach loops, I get each <h2> repeating itself several times.

Am I on the right track, or is there a better way to go about this?

Edit

I should say that I'm scraping a variety of sites and some have the format above but for others the <a> is else where, e.g. in the containing div.

Here is my code:

    function jobscrape($name, $url, $jobpage_url_root, $job_title_location, $job_title_url_location, $job_text) {

    echo "<h3>".$name."</h3>";

    // CREATE NEW DOM DOCUMENT BASED ON JOBLIST URL
    $html = file_get_contents($url);
    $doc = new DOMDocument();
    libxml_use_internal_errors(TRUE);

    // CHECK IF ANY HTML IS RETURNED (I.E. IF ABOVE HAS WORKED)
    if(!empty($html)) {

        // LOAD HTML INTO DOM DOCUMENT, CREATE NEW XPATH AND SET VARIABLE FOR THE JOB TITLE LOCATION
        $doc->loadHTML($html);
        libxml_clear_errors(); // remove errors for yucky html
        $xpath = new DOMXPath($doc);

// LOOP THROUGH JOBS LIST
        $row = $xpath->query("$job_title_location");
        // CHECK IF THERE ARE ANY ROWS MATCHING THE ABOVE LOCATION
        if ($row->length > 0) {
            // PULL THOSE ROWS INTO AN ARRAY
            foreach ($row as $jobpage_titles) {
                // SET THE JOBPAGE TITLE VARIABLE
                $jobpage_title = $jobpage_titles->nodeValue;
                // echo $jobpage_title."<br>";

// LOOP THROUGH JOBS PAGE URLS
                $row2 = $xpath->query("$job_title_url_location");
                // CHECK IF THERE ARE ANY ROWS MATCHING THE ABOVE LOCATION
                if ($row2->length > 0) {
                    //echo $jobpage_title." - hello<br>";
                    // PULL THOSE ROWS INTO AN ARRAY
                    foreach ($row2 as $jobpage_urls) {
                        // TRY TO PRINT VARIABLE FROM BEFORE
                        $href = $jobpage_url_root.$jobpage_urls->attributes->getNamedItem('href')->value;
                        echo "<a href='".$href."'>".$jobpage_title."</a><br>";
                    }
                }
            }
        }
    }
}

My out put is each list element printed one time for every URL, e.g.:

<h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 1</a></h2>
<h2 class="h2"><a href="and-another">List item 1</a></h2>

<h2 class="h2"><a href="this-is-a-link">List item 2</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 2</a></h2>
<h2 class="h2"><a href="and-another">List item 2</a></h2>

<h2 class="h2"><a href="this-is-a-link">List item 3</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 3</a></h2>
<h2 class="h2"><a href="and-another">List item 3</a></h2>

Just on a bigger scale because I'm scraping more than three things.

Sebastian
  • 3,548
  • 18
  • 60
  • 95
  • 1
    please show what you've tried so far, in which you get the unexpected output. also paste said unexpected output as is. unless today is your lucky day, nobody is going to write code from scratch for you. – skrilled Feb 21 '14 at 00:31

3 Answers3

0

You probably don't need to nest foreach loops in this case. Since you're getting the href attribute of an element and the text node of the same element, it could be done in the same iteration through the loop with no nesting.

0

Have you've considered using regex for scraping links?

preg_match_all('#<h2 class="h2"><a href="(.*)">(.*)</a></h2>#',$string,$matches);
foreach($matches[1] as $key=>$value)
  echo $value . " = " . $matches[2][$key] . "<br >";
Curtis W
  • 511
  • 4
  • 11
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – skrilled Feb 21 '14 at 00:42
  • While I agree that HTML can't be parsed by regex, if you only need a specific snippet of the code, regex can serve the purpose. – dab Feb 21 '14 at 00:47
0

You can use Regular expressions for something like this loop through each line and put it into $string. Then you can do something like this:

<?php

$string = '<h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>';

preg_match('/^<h2 class="h2">(<a href="[-A-Z0-9_.]+">)([-A-Z0-9 ._]+)<\/a><\/h2>$/i', $string, $matches);

print "<pre>"; print_r($matches); print "</pre>";

That will output:

Array
(
    [0] => <h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>
    [1] => <a href="this-is-a-link">
    [2] => List item 1
)

The items you want would be stored in $matches[1] and $matches[2].

Quixrick
  • 3,190
  • 1
  • 14
  • 17