I am trying to scrape a list of links in this format using DOM:
<h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 2</a></h2>
<h2 class="h2"><a href="and-another">List item 3</a></h2>
I need to have both the contents of the <h2>
(e.g. "List item 1") and the accompanying href
(e.g. "this-is-a-link") as variables in PHP.
I can scrape each one separately using a foreach
loop but once I try to print both at once by nesting foreach
loops, I get each <h2>
repeating itself several times.
Am I on the right track, or is there a better way to go about this?
Edit
I should say that I'm scraping a variety of sites and some have the format above but for others the <a>
is else where, e.g. in the containing div.
Here is my code:
function jobscrape($name, $url, $jobpage_url_root, $job_title_location, $job_title_url_location, $job_text) {
echo "<h3>".$name."</h3>";
// CREATE NEW DOM DOCUMENT BASED ON JOBLIST URL
$html = file_get_contents($url);
$doc = new DOMDocument();
libxml_use_internal_errors(TRUE);
// CHECK IF ANY HTML IS RETURNED (I.E. IF ABOVE HAS WORKED)
if(!empty($html)) {
// LOAD HTML INTO DOM DOCUMENT, CREATE NEW XPATH AND SET VARIABLE FOR THE JOB TITLE LOCATION
$doc->loadHTML($html);
libxml_clear_errors(); // remove errors for yucky html
$xpath = new DOMXPath($doc);
// LOOP THROUGH JOBS LIST
$row = $xpath->query("$job_title_location");
// CHECK IF THERE ARE ANY ROWS MATCHING THE ABOVE LOCATION
if ($row->length > 0) {
// PULL THOSE ROWS INTO AN ARRAY
foreach ($row as $jobpage_titles) {
// SET THE JOBPAGE TITLE VARIABLE
$jobpage_title = $jobpage_titles->nodeValue;
// echo $jobpage_title."<br>";
// LOOP THROUGH JOBS PAGE URLS
$row2 = $xpath->query("$job_title_url_location");
// CHECK IF THERE ARE ANY ROWS MATCHING THE ABOVE LOCATION
if ($row2->length > 0) {
//echo $jobpage_title." - hello<br>";
// PULL THOSE ROWS INTO AN ARRAY
foreach ($row2 as $jobpage_urls) {
// TRY TO PRINT VARIABLE FROM BEFORE
$href = $jobpage_url_root.$jobpage_urls->attributes->getNamedItem('href')->value;
echo "<a href='".$href."'>".$jobpage_title."</a><br>";
}
}
}
}
}
}
My out put is each list element printed one time for every URL, e.g.:
<h2 class="h2"><a href="this-is-a-link">List item 1</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 1</a></h2>
<h2 class="h2"><a href="and-another">List item 1</a></h2>
<h2 class="h2"><a href="this-is-a-link">List item 2</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 2</a></h2>
<h2 class="h2"><a href="and-another">List item 2</a></h2>
<h2 class="h2"><a href="this-is-a-link">List item 3</a></h2>
<h2 class="h2"><a href="this-is-another-link">List item 3</a></h2>
<h2 class="h2"><a href="and-another">List item 3</a></h2>
Just on a bigger scale because I'm scraping more than three things.