Is using PHP's explode() for HTML scraping considered a bad practice?

Question

I have been coding for a while now but just can't seem to get my head around regular expressions.

This brings me to my question which is the following: is it bad practice to use PHP's explode for breaking up a string of html code to select bits of text? I need to scrape a page for various bits of information and due to my horrific regex knowledge (In a full software engineering degree I had to write maybe one....) I decided upon using explode().

I have provided my code below so someone more seasoned than me can tell me if it's essential that I use regex for this or not!

public function split_between($start, $end, $blob)
{
    $strip = explode($start,$blob);
    $strip2 = explode($end,$strip[1]);
    return $strip2[0];
}

public function get_abstract($pubmed_id)
{
    $scrapehtml = file_get_contents("http://www.ncbi.nlm.nih.gov/m/pubmed/".$pubmed_id);
    $data['title'] = $this->split_between('<h2>','</h2>',$scrapehtml);
    $data['authors'] = $this->split_between('<div class="auth">','</div>',$scrapehtml);
    $data['journal'] = $this->split_between('<p class="j">','</p>',$scrapehtml);
    $data['aff'] = $this->split_between('<p class="aff">','</p>',$scrapehtml);
    $data['abstract'] = str_replace('<p class="no_t_m">','',str_replace('</p>','',$this->split_between('<h3 class="no_b_m">Abstract','</div>',$scrapehtml)));
    $strip = explode('<div class="ids">', $scrapehtml);
    $strip2 = explode('</div>', $strip[1]);
    $ids[] = $strip2[0];
    $id_test = strpos($strip[2],"PMCID");
    if (isset($strip[2]) && $id_test !== false)
    {
        $step = explode('</div>', $strip[2]);
        $ids[] = $step[0];
    }
    $id_count = 0;
    foreach ($ids as &$value) {
        $value = str_replace("<h3>", "", $value);
        $data['ids'][$id_count]['id'] = str_replace("</h3>", "", str_replace('<span>','',str_replace('</span>','',$value)));
        $id_count++;
    }

    $jsonAbstract = json_encode($data);

    echo $this->indent($jsonAbstract);
}

I would say yes, it's not a great approach. Try [How to parse and process HTML with PHP?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php). — Jared Farrish, Feb 19 '12 at 23:06
There are DOM parsers which don't use regexp, this is even worse. — Dejan Marjanović, Feb 19 '12 at 23:07
[Who can resist?](http://stackoverflow.com/a/1732454/451969) — Jared Farrish, Feb 19 '12 at 23:11
@JaredFarrish You should form an answer from your comments :-) — Dave Watts, Feb 19 '12 at 23:16

score 3 · Accepted Answer · answered Feb 19 '12 at 23:13

I highly recommend you try out the PHP Simple HTML DOM Parser library. It handles invalid HTML and has been designed to solve the same problem you're working on.

A simple example from the documentation is as follows:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) 
       echo $element->src . '<br>';

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

Thanks Thinkswan, I will take a look at this approach! – nmford Feb 19 '12 at 23:46 — nmford, Feb 19 '12 at 23:46

score 1 · Answer 2 · answered Feb 19 '12 at 23:13

1

It's not essential to use regular expressions for anything, although it'll be useful to get comfortable with them and know when to use them.

It looks like your scraping PubMed, which I'm guessing has fairly static mark-up in terms of mark-up. If what you have works and performs as you hope I can't see any reason to switch over to using regular expressions, they're not necessarily going to be any quicker in this example.

answered Feb 19 '12 at 23:13

Dave Watts

890
7
11

Thanks Dave. Yes it is fairly static mark-up and my code does work as-is, however I'm developing an open source app so I really want to use the best practices possible. The majority seem to be recommending some form of a DOM parser which I will try. – nmford Feb 19 '12 at 23:48
1

I think that's probably the way to go. Inevitably you or someone else will want to scrape more information from these pages (or similar pages elsewhere) ;-). Using some kind of DOM parser will help keep things more maintainable and open for change. – Dave Watts Feb 20 '12 at 00:00

score -1 · Answer 3 · answered Feb 19 '12 at 23:07

-1

Learn regular expressions and try to use a language that has libraries for this kind of task like perl or python. It will save you a lot of time. At first they might seem daunting but they are really easy for most of the tasks. Try reading this: http://perldoc.perl.org/perlre.html

answered Feb 19 '12 at 23:07

AlfredoVR

4,069
3
25
33

3

This is a PHP question; suggesting "switch to *language x*" is probably a poor answer. `:)` Especially when there are appropriate PHP-related techniques. – Jared Farrish Feb 19 '12 at 23:09
He said that he can't understand regular expressions, blame the language, they are not an integral part of PHP, or fast. – AlfredoVR Feb 19 '12 at 23:11
I don't know, some believe that [regex'ing markup](http://stackoverflow.com/a/1732454/451969) is flawed to begin with. Language-agnostic. – Jared Farrish Feb 19 '12 at 23:13
1

You don't necessarily need to learn regular expressions, but I highly recommend you use an HTML parsing library. – Graham Swan Feb 19 '12 at 23:14
1

In any case, rolling your own parser is very hard. – AlfredoVR Feb 19 '12 at 23:16
If you had the option of deriving the information contained in a webpage using jQuery or regexes, would you pick regexes? – Jared Farrish Feb 19 '12 at 23:24

Is using PHP's explode() for HTML scraping considered a bad practice?

3 Answers3