0

Is there a way, using Regex or other PHP functions, to extract all html text to a PHP array?

For example, I have this piece of code:
Example 1:

<div class="user" ><?= $username ?></div>
<table>
    <tr>
        <td>Cell 1</td>
        <td>Cell 2</td>
    </tr>
</table>
<span>Lorem ipsum <b>dolor</b> sit amet</span>
Lorem ipsum dolor sit amet <a href="www.example.com">Lorem</a>
Dolor site amet at date <?php echo date('Y-m-d'); ?> example

And I need some way to insert it in a form that will output an array like this:

Array(
    [0] => "Cell 1"
    [1] => "Cell 2"
    [2] => "Lorem ipsum <b>dolor</b> sit amet"
    [3] => "Lorem ipsum dolor sit amet "
    [4] => "Lorem"
    [5] => "Dolor site amet at date "
    [6] => " example"
)

But make exceptions for text decoration tags like <u> <b> <i>.

I tried using strip_tags with the mentioned exceptions but it is inconsistent and often it only returns the first string ignoring the rest.


UPDATE
This regex (?<=>)\s*(?=<)|(?<=>)\n*([^<]+) is almost what I asked for, there are only a few occurrences that it is letting escape.

When it finds script tags it returns waht is between them:

<script type="text/javascript">
    tipoProd = 'Squares';
</script>

Returns:

tipoProd = 'Squares';

And when it finds the line below:

<div class="content section" style="padding: 40px 0px; display: <?= $dev?'none':'block'?>; text-align:center" id="selectOptions">

Retunrs everything after PHP close tag:

; text-align:center" id="selectOptions">

How can I add this to the regex?

CIRCLE
  • 4,501
  • 5
  • 37
  • 56
  • 1
    Use `DomDocument` or `Simple PHP DOM Parser`. – Barmar Dec 11 '14 at 02:41
  • @Barmar: You mean Simple HTML DOM? I wouldn't recommend it unless you don't have the DOM extension available for some freakish reason. `DomDocument`, plus maybe `DomXPath`, is the way to go. – cHao Dec 11 '14 at 02:43

1 Answers1

1
(?<=>)\s*(?=<)|(?<=>)\n*([^<]+)

Try this.Grab the match or capture.See demo.

https://regex101.com/r/qB0jV1/6

$re = ""(?<=>)\\s*(?=<)|(?<=>)\\n*([^<]+)"i";
$str = "<div class=\"user\" ><?= \$username ?></div>\n<table>\n <tr>\n <td>Cell 1</td>\n <td>Cell 2</td>\n </tr>\n</table>\n<span>Lorem ipsum <b>dolor</b> sit amet</span>\nLorem ipsum dolor sit amet <a href=\"www.example.com\">Lorem</a>\nDolor site amet at date <?php echo date('Y-m-d'); ?> example";

preg_match_all($re, $str, $matches);
vks
  • 67,027
  • 10
  • 91
  • 124