How to properly get all html elements inside a table using regex in php?

Question

So I am using regex101.com to test my string and I can't get the output I need. The sample I made can be viewed here https://regex101.com/r/YQTW4c/2.

So my regex is this:

<table class=\"datatable\s\">(.*?)<\/table>

and the sample string:

<table class="datatable"><thead><tr><tr></thead></table>

I want to get the everything inside the table class datatable which, in this example, is <thead><tr><tr></thead>.

Am I missing something here? Any help would be much appreciated.

You **cannot** _properly_ get HTML content using _regex_, because that’s the wrong tool here. Any reason you can’t parse the HTML? — Sebastian Simon, Mar 22 '18 at 03:00
Do not do this. https://stackoverflow.com/a/1732454/486035 . Use a proper parsing tool. E.G. http://php.net/manual/en/book.dom.php — phemmer, Mar 22 '18 at 03:41

score 1 · Accepted Answer · answered Mar 22 '18 at 03:59

Your problem (as described by regex101) is that

"\s matches any whitespace character (equal to [\r\n\t\f\v ])"

So your regex requires a whitespace character between the e in datatable and the ", which doesn't exist. If you want to allow for zero or more spaces between that e and the ", you need to change your regex to

<table class=\"datatable\s*\">(.*?)<\/table>

Note that escaping " in regex's is not necessary (but I presume they are there because your regex is a quoted string).

What others have been saying about not using regex to parse HTML is very true; for example this regex will fail if two tables with class "datatable" are nested. It will also fail if a datatable is instantiated with additional classes. It is far better to use PHP tools built for the purpose.

score 1 · Answer 2 · answered Mar 22 '18 at 04:54

Very, very often do volunteers urge developers to use DomDocument, but very, very seldom does anyone actually code up a working solution. ...so I will offer a solution that uses DomDocument and XPath.

The table tag is targeted using its class and item(0) is its first child. saveHTML() is how you extract the data.

Code: (Demo)

$html = <<<HTML
<table class="datatable"><thead><tr><tr></thead></table>
HTML;

$dom=new DOMDocument; 
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("//table[contains(@class, 'datatable')]/*")->item(0);
echo $dom->saveHTML($node);

Output:

<thead>
<tr></tr>
<tr></tr>
</thead>

*Notice that the output dom is "corrected" with the inclusion of closing </tr> tags.

How to properly get all html elements inside a table using regex in php?

2 Answers2