1

I'm parsing information from a very long HTML table; right now the code I'm using parses using the DOMDocument, DOMElement (etc) classes. I wanted to do a performance test running the current method against Regex'ing the information out of the table but I can't get the right expression.

An HTML row of the table looks like this:

<tr><td>   JON SMITH     </td><td> 2000-09-29 </td></tr>

And the expression I've been attempting looks something like this:

/(?:<td>([a-zA-Z\s]*?)<\/td><td>([0-9-\s]*?)<\/td>)/

The issue with the above expression is that it's returning the entire row contents and not just the inner column contents. Ideally the preg_match_all array results would be name, date, name, date etc.

Is this a reasonable thing to do, or should I stick with the DOM technique? If it is reasonable, could someone lend a hand with the regex?

Thanks!

EDIT: In case anyone stumbles upon this in the future, the RegEx solution has WAY better performance than using the DOM classes; in my situation it's the difference between seconds and minutes.

Jordan N
  • 199
  • 6
  • 16
  • Please refrain from parsing HTML with RegEx as it will [drive you į̷̷͚̤̤̖̱̦͍͗̒̈̅̄̎n̨͖͓̹͍͎͔͈̝̲͐ͪ͛̃̄͛ṣ̷̵̞̦ͤ̅̉̋ͪ͑͛ͥ͜a̷̘͖̮͔͎͛̇̏̒͆̆͘n͇͔̤̼͙̩͖̭ͤ͋̉͌͟eͥ͒͆ͧͨ̽͞҉̹͍̳̻͢](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use an [HTML parser](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) instead. – Madara's Ghost Aug 02 '12 at 13:39
  • Thanks for the tip. As mentioned I'm currently parsing using the PHP DOM classes; in a situation like this if RegEx offers better performance I'd be willing to try it. Normally I'd 100% agree with you and stick to a proper HTML parsing option. – Jordan N Aug 02 '12 at 13:48

2 Answers2

0

My solution:

step1. search <table>...</table>:
/<table[^>]*+>([^<]*+(?:(?!<\/?+table)<[^<]*+)*+)<\/table>/i

step2. search all <tr>...</tr> from step1 group1:
/<tr[^>]*+>([^<]*+(?:(?!<\/?+tr)<[^<]*+)*+)<\/tr>/ix

step3. extract data from every <td>...</td>(from step2 group1):
/<td[^>]*+>([^<]*+(?:(?!<\/?+td)<[^<]*+)*+)<\/td>/ix

these terrible patterns refer to Mastering Regular Expressions 3rd

sample code:

    <?php
$foo = '<tr><td>   JON SMITH     </td><td> 2000-09-29 </td></tr>';
if(preg_match_all('/<td[^>]*+>([^<]*+(?:(?!<\/?+td)<[^<]*+)*+)<\/td>/ix', $foo, $matches) > 0){
    for($i = 0; $i < count($matches[0]); ++$i)
        printf("%s\n", $matches[0][$i]);

    for($i = 0; $i < count($matches[1]); ++$i)
        printf("%s\n", $matches[1][$i]);
}
?>

output:

<td>   JON SMITH     </td>
<td> 2000-09-29 </td>
JON SMITH
2000-09-29
godspeedlee
  • 672
  • 3
  • 7
  • I ran a test and while it certainly does work (and works well), it's a little less elegant and offers slightly poorer performance than the answer above. Thanks anyway :) – Jordan N Aug 02 '12 at 17:44
  • really? I tested my sample code again with RegexBuddy. Match found in 18 steps. But another solution need 118 steps :P. Basically, my pattern combines unrolling loop technique with possessive quantifiers, it should be a fastest solution. – godspeedlee Aug 03 '12 at 01:55
0

use preg_match_all() and pass the third parameter with the array to fill and the fourth parameter PREG_SET_ORDER.

preg_match_all("/(?:<td>([a-zA-Z\s]*?)<\/td><td>([0-9-\s]*?)<\/td>)/", $html, $matches, PREG_SET_ORDER);

The result array should be like this :

$matches => array(
   [0] => array(
      [0] => '<td>   JON SMITH     </td><td> 2000-09-29 </td>',
      [1] => '   JON SMITH     ',
      [2] => ' 2000-09-29 '
   ),
   [1] => array(
      [0] => '<td>   JACK BOLD     </td><td> 2000-10-20 </td>',
      [1] => '   JACK BOLD     ',
      [2] => ' 2000-10-20 '
   ),
   ...
);

Please refer to preg_match_all() documentation.

Oussama Jilal
  • 7,669
  • 2
  • 30
  • 53