1

I have a string, which consists of any html elements. For example, I have this string:

$htmlString = '<p>Test</p>
    <h2>Test2</h2>
    <table>
        <thead>
            <tr>
                <td>Header 1</td>
                <td>Header 2</td>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Col 1</td>
                <td>Col 2</td>
            </tr>
        </tbody>
    </table>
    <span>Test span </span>
';

As you can see, the string consists of <p>, <h2>, <table>, <span> tags, and it could also contain other html tags.

My question is, is there a way so that I can make the string remove all the other elements except the <table>, rest assured that there are no other tags other than thead, tr, td, tbody inside the table element?

jthinam
  • 279
  • 2
  • 7
  • Hi jthinam, have a look at that question [How do you parse and process HTML/XML in PHP?](https://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) The accepted and probably the second most upvoted answer could help you to achieve your goal. – Uwe Aug 10 '22 at 11:45

2 Answers2

2

This will probably be closed as a duplicate, but before that happens here’s some quick code to help you with your specific HTML. Instead of “removing” everything except your target text, we are “extracting” our target text. The code itself is pretty straightforward so I didn’t see a need to comment things as much as I usually do.

<?php
$htmlString = '<p>Test</p>
    <h2>Test2</h2>
    <table>
        <thead>
            <tr>
                <td>Header 1</td>
                <td>Header 2</td>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Col 1</td>
                <td>Col 2</td>
            </tr>
        </tbody>
    </table>
    <span>Test span </span>
';
$dom = new DOMDocument();
$dom->loadHTML($htmlString, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->preserveWhiteSpace = true;
$tables = $dom->getElementsByTagName('table');
foreach($tables as $table) {
    var_dump($dom->saveHTML($table));
}

Demo here: https://3v4l.org/YjkdT

Chris Haas
  • 53,986
  • 12
  • 141
  • 274
  • Thanks, I think this is the safest and cleanest way. – jthinam Aug 10 '22 at 11:57
  • Question: When you `echo $dom->saveHTML()` after `loadHTML();` the HTML looks malformed. Do you have any idea, why? It looks like `LIBXML_HTML_NOIMPLIED` is causing that. – Markus Zeller Aug 10 '22 at 12:51
  • I'm not sure if I just don't have enough coffee yet, but I'm not seeing anything malformed, can you elaborate? The only arguable difference that I see is that the `` doesn't have leading spaces which is to be expected, since the spaces is before the table and therefor not part of it.
    – Chris Haas Aug 10 '22 at 13:17
-3

May this be the solution you are searching: https://www.php.net/manual/en/function.strip-tags.php

<?php
$striped = strip_tags($htmlString, '<table>');
?>
Schlotter
  • 105
  • 1
  • 6
  • Tried this, `strip_tags($htmlString, '')`, but what it does is it removes all tags other than the table tags, but not the words inside other tags
    – jthinam Aug 10 '22 at 11:49
  • You are getting downvoted because its a short answer and mainly a link. If you described your solution, then maybe the downvoting will stop. – Rohit Gupta Aug 10 '22 at 13:48