1

I can't seem to get the hang of regular expressions in php. Specifically, the group capturing part.

I have a string that looks like this

<table cellpadding="0" cellspacing="0" border="0" width="100%" class="List">

  <tr class='row_type_1'>
    <td class="time">
                      3:45 pm
    </td>
    <td class="name">
                      Kira
    </td>
  </tr>

  <tr class='row_type_2'>
    <td class="time">
                      4:00 pm
    </td>
    <td class="name">
                      Near
    </td>
  </tr>

</table>

And I want my array to look like this

Array
(
   [0] => Array
   (
      [0] => 3:45 pm
      [1] => Kira
   )
   [1] => Array
   (
      [0] => 4:00 pm
      [1] => Near
   )
)

I want to use only preg_match, and not explode, array_keys or loops. Took me a while to figure out I needed a /s for .* to count line breaks; I'm really eager to see the pattern and the capture syntax.

Edit: The pattern would just need something like (row_type_1|row_type_2) to capture the only two types of row in the table I want data from. For example, after row_type_2 came row_type_3, followed by row_type_1, then row_type_3 would be ignored and the array would only add data from row_type_1 like what I have below.

Array
(
   [0] => Array
   (
      [0] => 3:45 pm
      [1] => Kira
   )
   [1] => Array
   (
      [0] => 4:00 pm
      [1] => Near
   )
   [2] => Array
   (
      [0] => 5:00 pm
      [1] => L
   )
)
Satbir Kira
  • 792
  • 6
  • 21

4 Answers4

1

I would use XPath and DOM to retrieve the information from HTML. Using regexes for this can get messy if the HTML or the query get more complex. (as you currently see). And DOM and XPath are standards for this. Why not using it?

Imagine this code example:

// load the HTML into a DOM tree
$doc = new DOMDocument();
$doc->loadHtml($html);

// create XPath selector
$selector  = new DOMXPath($doc);

// grab results
$result = array();
// select all tr that class starts with 'row_type_'
foreach($selector->query('//tr[starts-with(@class, "row_type_")]') as $tr) {
    $record = array();
    // select the value of the inner td nodes
    foreach($selector->query('td[@class="time"]', $tr) as $td) {
        $record[0]= trim($td->nodeValue);
    }
    foreach($selector->query('td[@class="name"]', $tr) as $td) {
        $record[1]= trim($td->nodeValue);
    }
    $result []= $record;
}

var_dump($result);
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Thanks for leading me in the right direction. I'm going to try using a library called 'PHP Simple HTML DOM Parser'. – Satbir Kira Apr 18 '13 at 20:10
  • If you like it you can do. It's much better than regexes for this purpose. :) I would prefer DOMXPath as its a php builtin and therefore it will be 1.) available out of the box 2.) faster – hek2mgl Apr 18 '13 at 20:11
  • I can't say DOMXPath looks like I would be comfortable going back to fix if the website I'm scraping changes its html. I have the luxury of my own server space so I'm okay with external libraries. Funny thing is, one of my first projects in my university's c++/bash/shell course was to scrap their website using egrep. Obviously, I should have know it was not practical and only for the purposes of example. – Satbir Kira Apr 18 '13 at 20:23
  • You can tell them about DOM. Maybe you'll get some extra points ;) – hek2mgl Apr 18 '13 at 20:26
  • Took the class a year ago. Cheers. – Satbir Kira Apr 18 '13 at 20:45
0

You should not parse html using regular expressions for a few reasons. The biggest reason is it is hard to account for not well formatted html and can get large and slow trying to.

I would suggest looking into using the php DOM parser or a php HTML parser.

jgetner
  • 691
  • 6
  • 14
0

Try this:

function extractData($str){
    preg_match_all("~<tr class='row_type_\d'>\s*<td class=\"time\">(.*)</td>\s*<td class=\"name\">(.*)</td>\s*</tr>~Usim", $str, $match);
    $dataset = array();
    array_shift($match);
    foreach($match as $rowIndex => $rows){
        foreach ($rows as $index => $data) {
            $dataset[$index][$rowIndex] = trim($data);
        }
    }
    return $dataset;
}

$myData = extractData($str);
Rafael Freitas
  • 103
  • 1
  • 8
0

Hell's road is here :

$pattern = '`<tr .*?"time">\s++(.+?)\s++</td>.*?"name">\s++(.+?)\s++</td>`s';
preg_match_all($pattern, $subject, $matches, PREG_SET_ORDER);
foreach ($matches as &$match) {
    array_shift($match);
}
?><pre><?php print_r($matches);
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125