PHP Preg Matching Capture Groups

Question

I can't seem to get the hang of regular expressions in php. Specifically, the group capturing part.

I have a string that looks like this

<table cellpadding="0" cellspacing="0" border="0" width="100%" class="List">

  <tr class='row_type_1'>
    <td class="time">
                      3:45 pm
    </td>
    <td class="name">
                      Kira
    </td>
  </tr>

  <tr class='row_type_2'>
    <td class="time">
                      4:00 pm
    </td>
    <td class="name">
                      Near
    </td>
  </tr>

</table>

And I want my array to look like this

Array
(
   [0] => Array
   (
      [0] => 3:45 pm
      [1] => Kira
   )
   [1] => Array
   (
      [0] => 4:00 pm
      [1] => Near
   )
)

I want to use only preg_match, and not explode, array_keys or loops. Took me a while to figure out I needed a /s for .* to count line breaks; I'm really eager to see the pattern and the capture syntax.

Edit: The pattern would just need something like (row_type_1|row_type_2) to capture the only two types of row in the table I want data from. For example, after row_type_2 came row_type_3, followed by row_type_1, then row_type_3 would be ignored and the array would only add data from row_type_1 like what I have below.

Array
(
   [0] => Array
   (
      [0] => 3:45 pm
      [1] => Kira
   )
   [1] => Array
   (
      [0] => 4:00 pm
      [1] => Near
   )
   [2] => Array
   (
      [0] => 5:00 pm
      [1] => L
   )
)

Never process HTML with regular expressions, use a DOM parser instead. — erenon, Apr 18 '13 at 19:51
@SatbirKira: Because you won't get it right. And on the slightest change to your markup, your regex will be broken. Use an HTML parser. — Madara's Ghost, Apr 18 '13 at 20:02

score 1 · Accepted Answer · answered Apr 18 '13 at 19:58

1

I would use XPath and DOM to retrieve the information from HTML. Using regexes for this can get messy if the HTML or the query get more complex. (as you currently see). And DOM and XPath are standards for this. Why not using it?

Imagine this code example:

// load the HTML into a DOM tree
$doc = new DOMDocument();
$doc->loadHtml($html);

// create XPath selector
$selector  = new DOMXPath($doc);

// grab results
$result = array();
// select all tr that class starts with 'row_type_'
foreach($selector->query('//tr[starts-with(@class, "row_type_")]') as $tr) {
    $record = array();
    // select the value of the inner td nodes
    foreach($selector->query('td[@class="time"]', $tr) as $td) {
        $record[0]= trim($td->nodeValue);
    }
    foreach($selector->query('td[@class="name"]', $tr) as $td) {
        $record[1]= trim($td->nodeValue);
    }
    $result []= $record;
}

var_dump($result);

answered Apr 18 '13 at 19:58

hek2mgl

152,036
28
249
266

Thanks for leading me in the right direction. I'm going to try using a library called 'PHP Simple HTML DOM Parser'. – Satbir Kira Apr 18 '13 at 20:10
If you like it you can do. It's much better than regexes for this purpose. :) I would prefer DOMXPath as its a php builtin and therefore it will be 1.) available out of the box 2.) faster – hek2mgl Apr 18 '13 at 20:11
I can't say DOMXPath looks like I would be comfortable going back to fix if the website I'm scraping changes its html. I have the luxury of my own server space so I'm okay with external libraries. Funny thing is, one of my first projects in my university's c++/bash/shell course was to scrap their website using egrep. Obviously, I should have know it was not practical and only for the purposes of example. – Satbir Kira Apr 18 '13 at 20:23
You can tell them about DOM. Maybe you'll get some extra points ;) – hek2mgl Apr 18 '13 at 20:26
Took the class a year ago. Cheers. – Satbir Kira Apr 18 '13 at 20:45

score 0 · Answer 2 · answered Apr 18 '13 at 20:08

You should not parse html using regular expressions for a few reasons. The biggest reason is it is hard to account for not well formatted html and can get large and slow trying to.

I would suggest looking into using the php DOM parser or a php HTML parser.

score 0 · Answer 3 · answered Apr 18 '13 at 20:42

Try this:

function extractData($str){
    preg_match_all("~<tr class='row_type_\d'>\s*<td class=\"time\">(.*)</td>\s*<td class=\"name\">(.*)</td>\s*</tr>~Usim", $str, $match);
    $dataset = array();
    array_shift($match);
    foreach($match as $rowIndex => $rows){
        foreach ($rows as $index => $data) {
            $dataset[$index][$rowIndex] = trim($data);
        }
    }
    return $dataset;
}

$myData = extractData($str);

score 0 · Answer 4 · answered Apr 19 '13 at 00:56

0

Hell's road is here :

$pattern = '`<tr .*?"time">\s++(.+?)\s++</td>.*?"name">\s++(.+?)\s++</td>`s';
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
foreach ($matches as &$match) {
    array_shift($match);
}
?><pre><?php print_r($matches);

answered Apr 19 '13 at 00:56

Casimir et Hippolyte

88,009
5
94
125

PHP Preg Matching Capture Groups

4 Answers4