Regex to pull data off website

Question

I am looking for a job. And I am working on a script that will cron once a day. It is pulling text and links from a website. I am helpless when it comes to regex patterns.

Here is an example of what data I am pulling from:

<div class="cat-list-item job-list-item">

<h3 class="expressway full-width"><a href="/about/careers/network_engineer_voip_telephony">Network Engineer - VoIP Telephony</a></h3>

<div class="career-summary">

    <p>
        Provide daily support, proactive maintenance and independent troubleshooting, and identify capacity/performance issues to ensure
    </p>

</div>

<p class="locations-heading"><b>Locations</b></p>

<ul class="locations-list normal">


    <li>
        Elizabethtown Headquarters
    </li>

</ul>

<div class="list-bottom">
    <a class="learn-more replace" href="/about/careers/network_engineer_voip_telephony">Learn More</a>
</div>

Here is what I have so far:

<?php
$url = "http://bluegrasscellular.com/about/careers/";
$input = @file_get_contents($url) or die("Could not access file: $url");
$regexp = "<h3 class=\"expressway full-width\"><a\s[^>]*href=\"\/about\/careers\/(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if (preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
    foreach ($matches as $match) {
        // $match[2] = link address
        // $match[3] = link text
        echo "<a href='http://bluegrasscellular.com/about/careers/{$match[2]}'>{$match[3]}</a><br>";
    }
}
?>

All that does however is pulls the text and href off the . I am wanting to also grab the following:

Provide daily support, proactive maintenance and independent troubleshooting, and identify capacity/performance issues to ensure
Elizabethtown Headquarters

I am eventually wanting to store these in a database and notify me of any new positions. I have no clue how to go about this. Any help is greatly appreciated.

Use [`DomDocument`](http://us1.php.net/manual/en/class.domdocument.php) — Jon, May 03 '13 at 20:13
And BTW... I do realize the text looks skewed... I left it like that because that is how it is on the site. I figured it would need to be exact for regex. — alexander7567, May 03 '13 at 20:32
+1 for supplying a code attempt and examples for your problem. — halfer, May 03 '13 at 20:42
**Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester, May 03 '13 at 20:48
@AndyLester I second your comment. I always refer HTML Parsing with REGEX to [this lovely post on SO.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Walls, May 03 '13 at 20:54
@Walls: I wish you wouldn't refer to that post. Although you and I may find it funny, that post is meaningless to people who don't understand why it is a bad idea to parse HTML with a regex. It doesn't tell them what the problem is and it doesn't tell them the right way to do it. That's why I created http://htmlparsing.com/, specifically so we *can* give them the right answer easily. — Andy Lester, May 03 '13 at 20:57
I never bothered to look at that post till now... Lets just say i saved it for future laughs lol. ( It appears they are thinking about deleting it, so thanks to evernote, i saved it!) But I will defiantly look into the DOM document class. — alexander7567, May 06 '13 at 15:02

Expedito · Accepted Answer · 2013-05-06T22:30:34.440

Use the Dom Document Class. Start with the following:

$doc = new DOMDocument();
//load HTML string into document object
if ( ! @$doc->loadHTML($html)){
    return FALSE;
}
//create XPath object using the document object as the parameter
$xpath = new DOMXPath($doc);

And then you need to write a query for each element you want to extract. To get the text in the "career-path" div, you could use teh following xpath query:

$query = "//div[@class='career-summary']";
//XPath queries return a NodeList
$res = $xpath->query($query);
$text = trim($res->item(0)->nodeValue);

I haven't tested it, but that's the general idea. The following query should get the text from the specified list element:

$query = "//ul[@class='locations-list normal']";

For doing that sort of thing, it's well worth your while to learn about xpath queries. They're much better than regular expressions when working with HTML or XML.

EDIT:

For accessing multiple items, you might have to change your query. For example, if there are multiple list items, you can change the query as follows:

$query = "//ul[@class='locations-list normal']/li";

The "/li" is saying that you want the list items within the "ul" tag with the specified class. Once you have your results, you can loop through them with a foreach loop:

$out = array;
foreach ($res as $node){
    $out[] = $node->nodeValue;
}

@alexander7567 - I tested the code and it worked as expected. — Expedito, May 03 '13 at 20:58
it looks like this will do what I want it to then. but my next question is how would I be able to loop through them all? there is at least 20 different job listings at a time. — alexander7567, May 06 '13 at 21:45
ok so pretty much just a foreach loop. thanks! Time to get started studying! — alexander7567, May 07 '13 at 00:00

Regex to pull data off website

1 Answers1