Regular expression to extract the second table from an HTML page

Question

How do I build a regular expression to extract content of a <table>? I want scrape a website, but not the first table, only the second in the page. I doing this:

preg_match('/<table[^>]+cellspacing="0"[^>]*>(.*?)<\/table>', $returnCurl, $features);

and the HTML is here

I want the "features" table only.

Second duplicate today, sigh... why is everyone trying to parse HTML with regular expressions where it's **impossible?** — The Paramagnetic Croissant, May 26 '14 at 21:36
@user3477950 Actually it's not `impossible`, but it should be avoided in most all cases. — hwnd, May 26 '14 at 21:40
@hwnd as explained in the linked post, it is impossible to write a proper, fully-functional, general HTML parser solely using regular expressions, since HTML is not a regular language. It may be possible to parse specific snippets of HTML (or a subset thereof) with regular expressions, though, but that is not the general case. — The Paramagnetic Croissant, May 26 '14 at 21:41
@MarcB In the general case, it **is** impossible. (I did *not* assert that there's no HTML that can possibly be processed using regexes. You seem to be confusing "there exists !X" with "there does not exist X".) — The Paramagnetic Croissant, May 26 '14 at 21:42

Lawrence Cherone · Answer 1 · 2014-05-26T22:14:53.453

Premature accept I think, if you want to do it using DOMDocument then here's a generic DOM scrapping class I built earlier, its very basic.. There's also Simple HTML DOM if you want more features, but the bottom line is Dont use regex to parse HTML!

<?php 
$site = 'http://www.grossiste-informatique.com/grossiste/detail_article_popup.php?code_article=POA/F200CA-KX019H';

$scraper = new DOMScraper();

//Set site and get source
$scraper->setSite($site)
        ->setSource();


echo '<table cellspacing="0" cellpadding="3" border="0" width="100%">',
        //match and return only tables inner content with cellpadding="3"
        $scraper->getInnerHTML('table', 'cellpadding=3'), 
     '</table>';

/**
 * Generic DOM scapper using DOMDocument and cURL
 */
Class DOMScraper extends DOMDocument{
    public $site;
    private $source;
    private $dom;

    function __construct(){
        libxml_use_internal_errors(true);
        $this->preserveWhiteSpace = false;
        $this->strictErrorChecking = false;
        $this->formatOutput = true;
    }

    function setSite($site){
        $this->site = $site;
        return $this;
    }

    function setSource(){
        if(empty($this->site))return 'Error: Missing $this->site, use setSite() first';
        $this->source = $this->get_data($this->site);
        return $this;
    }

    function getInnerHTML($tag, $id=null, $nodeValue = false){
        if(empty($this->site))return 'Error: Missing $this->source, use setSource() first';
        $this->loadHTML($this->source);
        $tmp = $this->getElementsByTagName($tag);
        $ret = null;
        foreach ($tmp as $v){
            if($id !== null){
                $attr = explode('=',$id);
                if($v->getAttribute($attr[0])==$attr[1]){
                    if($nodeValue == true){
                        $ret .= trim($v->nodeValue);
                    }else{
                        $ret .= $this->innerHTML($v);
                    }
                }
            }else{
                if($nodeValue == true){
                    $ret .= trim($v->nodeValue);
                }else{
                    $ret .= $this->innerHTML($v);
                }
            }
        }
        return $ret;
    }

    function innerHTML($dom){
        $ret = "";
        $nodes = $dom->childNodes;
        foreach($nodes as $v){
            $tmp = new DOMDocument();
            $tmp->appendChild($tmp->importNode($v, true));
            $ret .= trim($tmp->saveHTML());
        }
        return $ret;
    }

    function get_data($url){
        if(function_exists('curl_init')){
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_TIMEOUT, 5);
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
            $data = curl_exec($ch);
            curl_close($ch);
            return $data;
        }else{
            return file_get_contents($url);
        }
    }
}
?>

score -3 · Accepted Answer · edited May 23 '17 at 12:20

-3

I'll be the first to link you to the relevant post

Use DOMDocument instead.

Also, if you REALLY want (and you really should not want this), you can try this regex (untested):

preg_match('/<table[^>]+>.*?<table[^>]+>(.*?)<\/table>/is', $returnCurl, $features);

edited May 23 '17 at 12:20

Community

1
1

answered May 26 '14 at 21:37

GManz

1,548
2
21
42

thx Hosh your regex is good ! – Astoria Andrew May 26 '14 at 21:43
1

Why say use DOMDocument but then revert back to using regex. – Lawrence Cherone May 26 '14 at 21:43
1

@LozCheroneツ Because doing a proper DOMDocument implementation will take a lot longer than coming up with a quick regex that may work (and with that I mean break). Anyway I've given OP advice on what to do, if he wants, he can look into it, if he's too lazy to do so, I'm not going to waste my time for him. – GManz May 26 '14 at 21:45

Regular expression to extract the second table from an HTML page

2 Answers2