Why get data is empty when using curl and regex

Question

Please help me check this code. I think my regex wrote has a problem but I don't know how to fix it:

function get_data($url)
{
    $ch = curl_init();
    $timeout = 5;
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

$content = get_data('http://ibongda.vn/lich-thi-dau-bong-da.hs');
$regex = '/<div id="zone-schedule-group-by-season">(.*)<\/div>/';
preg_match($regex, $content, $matches);
$table = $matches[1];
print_r($table);

The bug isn't in your regex, it's in your design. Regex is not the correct tool to parse HTML. I suggest looking at one of 'soup' families of HTML parsers - at a glance http://simplehtmldom.sourceforge.net/ looks like a good option. — Brian Melton-Grace - MSFT, Oct 20 '14 at 02:33
I try simpledomhtml but it's very slow. My hosting has php 5.3 so I can't use the newest goutte version. I don't know other way :( — Nam Nguyen, Oct 20 '14 at 02:34
Using DOM is never slower than RegExp once the input is just DOM. — King King, Oct 20 '14 at 03:30

hwnd · Answer 1 · 2014-10-20T02:49:46.237

2

I would advise against using regular expression for this. You should use DOM for this task.

The problem with your regular expression is running into newline sequences, it will match until the < in </div>, continuously keep backtracking and fail. Backtracking is what regular expressions do during the course of matching when a match fails. You need to use the s (dotall) modifier which forces the dot to match newlines as well.

$regex = '~<div id="zone-schedule-group-by-season">(.*?)</div>~s';

edited Oct 20 '14 at 02:49

answered Oct 20 '14 at 02:35

hwnd

69,796
4
95
132

I'll follow DOM, many thanks :) I got it :) – Nam Nguyen Oct 20 '14 at 03:24

Kevin · Accepted Answer · 2014-10-20T03:01:15.860

I suggest don't use regex to parse these. You can use an HTML Parser, DOMDocument with xpath in particular.

function get_data($url)
{
    $ch = curl_init();
    $timeout = 5;
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

$content = get_data('http://ibongda.vn/lich-thi-dau-bong-da.hs');
$dom = new DOMDocument();
libxml_use_internal_errors(true); // handle errors yourself
$dom->loadHTML($content);
libxml_clear_errors();
$xpath = new DOMXpath($dom);

$table_rows = $xpath->query('//div[@id="zone-schedule-group-by-season"]/table/tbody/tr[@class!="bg-gd" and @class!="table-title"]'); // these are the rows of that table

foreach($table_rows as $rows) { // loop each tr
    foreach($rows->childNodes as $td) { // loop each td
        if(trim($td->nodeValue) != '') { // don't show empty td
            echo trim($td->nodeValue) . '<br/>';
        }
    }
    echo '<hr/>';
}

i suggest you link to one of the billion duplicates ;-) In preference to answering — , Oct 20 '14 at 02:32
how do I get html element from $table? I echo $table->item(0)->nodeValue but I'm only get text. — Nam Nguyen, Oct 20 '14 at 03:00

Why get data is empty when using curl and regex

2 Answers2