1

hey guys, a curl function returns a string $widget that contains regular html -> two divs where the first div holds a table with various values inside of <td>'s.

i wonder what's the easiest and best way for me to extract only all the values inside of the <td>'s so i have blank values without the remaining html.

any idea what the pattern for the preg_match should look like?

thank you.

reko_t
  • 55,302
  • 10
  • 87
  • 77
matt
  • 42,713
  • 103
  • 264
  • 397

4 Answers4

2

Regex is not a suitable solution. You're better off loading it up in a DOMDocument and parsing it.

Brad Christie
  • 100,477
  • 16
  • 156
  • 200
1

You shouldn't use regexps to parse HTML. Use DOM and XPath instead. Here's an example:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//td');
$result = array();
foreach ($nodes as $node) {
    $result[] = $node->nodeValue;
}
// $result holds the values of the tds
reko_t
  • 55,302
  • 10
  • 87
  • 77
1

You're betting off using a DOM parser for that task:

$html = <<<HTML
<div>
<table>
   <tr>
      <td>foo</td>
      <td>bar</td>
   </tr>
   <tr>
      <td>hello</td>
      <td>world</td>
   </tr>
</table>
</div>
<div>
   Something irrelevant
</div>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);

$tds = $xpath->query('//div/table/tr/td');
foreach ($tds as $cell) {
    echo "{$cell->textContent}\n";
}

Would output:

foo
bar
hello
world
netcoder
  • 66,435
  • 19
  • 125
  • 142
0

Only if you have very limited, well-defined HTML can you expect to parse it with regular expressions. The highest ranked SO answer of all time addresses this issue.

He comes ...

Community
  • 1
  • 1
Peter Rowell
  • 17,605
  • 2
  • 49
  • 65