1

What I'm trying to do is take a block of html, strip out all the html tags, and put each line of text into a PHP array.

I'm just trying it with one block to test (hence the WHERE ID = '2409' in my mysql query.

The HTML portion for ID 2409 looks like this:

<table class="description-table">
<tbody>
<tr><td>Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9</td></tr>
<tr><td>Description</td></tr>
<tr><td></td>
<td><br>
<br><p></p><p></p>
<strong><br></strong> <strong><br></strong> <strong>Donec Rem </strong><br>
<br>
<strong>Animam Urgebat<br>
<br></strong> <strong><br>
<br>
Rerum Sed 8613 - 3669 8358 & 6699<br>
<br>
1.mE (magNA) QUO Ad Nominum Statum Massa<br>
ab SEM Autem Reddet Habitu Sit<br>
<br></strong> <strong> PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM</strong> <strong><br></strong> <strong><br></strong> <strong>Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!</strong><strong><br></strong><strong>                                                           ad Quisque Modeste</strong><strong>                                                           ac Rem Wisi</strong><strong>                                                           ex Hac Congue mus Leo</strong><strong>                                                           ab 7/92" Alias</strong><strong>                                                           ad 2/73" Adverso & Erat</strong><strong>                                                           me Personom Eget</strong><strong>                                                           ad Viribus Fuga Fuga</strong><strong>                                                           ab Louor-Sit Molles</strong><strong class="c2">                                                           3x Block-Off Plates</strong><strong class="c2">                                                           ad Facunda</strong><strong class="c2">                                                           ab Personas Diam<br>
NUNC<br>
ex Teniet te Palmam Eaque<br>
me Teniet in Versus Urna<br></strong> <strong><br></strong><br>
<strong class="c3">**CONDEMNENDUS REM CUM MAGNORUM**</strong><strong></strong><br>
</td>
</table>

And here's my PHP script designed to parse this

//connect to mysqli

$results = $mysqli->query("SELECT ID, post_content
FROM wp_posts'
WHERE ID = '2409';");

while($row = $results->fetch_array()) {
    $htmlarray2 = preg_split('/<.+?>/', $row['post_content']);
    $htmlarray = array_values(array_filter(array_map('trim', $htmlarray2)));
    echo '<pre>';
        print_r($htmlarray);
    echo '</pre>';
    . . . 
}

This produces an output like this

Array
(
[0] => Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9
[1] => Donec Rem 
[2] => Animam Urgebat
[3] => Rerum Sed 8613 - 3669 8358 & 6699
[4] => 1.mE (magNA) QUO Ad Nominum Statum Massa
[5] => ab SEM Autem Reddet Habitu Sit
[6] =>  PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM
[7] => Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!
[8] =>                                                            ad Quisque Modeste
[9] =>                                                            ac Rem Wisi
[10] =>                                                            ex Hac Congue mus Leo
[11] =>                                                            ab 7/92" Alias
[12] =>                                                            ad 2/73" Adverso & Erat
[13] =>                                                            me Personom Eget
[14] =>                                                            ad Viribus Fuga Fuga
[15] =>                                                            ea Totam Poenam
[16] =>                                                            ab Louor-Sit Molles
[17] =>                                                            ad Facunda
[18] =>                                                            ab Personas Diam
[19] => NUNC
[20] => ex Teniet te Palmam Eaque
[21] => me Teniet in Versus Urna
[22] => **CONDEMNENDUS REM CUM MAGNORUM**
)

This is okay, but now I'm having issue with removing the white-spaces before and after the strings in the array.

Let's take an example for the node 8 in the array

. . .
$arrayvalue = $htmlarray2['8'];

which echoes like this

                                                       ad Quisque Modeste

Now, what I'm trying to do is obviously trim each element of the array, but for testing I'm just working with this one variable $arrayvalue.

My issue is that trim() isn't working with this MySQL fetched variable. Meaning adding trim($arrayvalue); has no affect and echoes out the same way as above.

I know this is something to do with me fetching the array via my query, because if I just test this variable out normally in it's own php script

$string = '                                                            ad Quisque Modeste  ';
echo trim($string);

It works fine, and echo outputs just simply ad Quisque Modeste with the desired no white-spaces before or after the string.

Why isn't trim() working in my while loop? What's the trick to trimming the leading and trailing white-spaces from the elements?

Edit: Here's my full while loop as requested. It's a bit different then the above example (I've been doing a lot of modifications trying to solve this myself so it's constantly changing), but here is what I have right now in full:

while($row = $results->fetch_array()) {
    $id = $row['ID'];
    echo 'ID: ' . $id;
    echo '<br  />';

    //replace &nbsp; with white space
    $converted = strtr($row['post_content'],array_flip(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES))); 
    trim($converted, chr(0xC2).chr(0xA0));

    //remove html elements
    $htmlarray = preg_split('/<.+?>/', $converted);

    // remove empty array elements and re-index array
    $htmlarray2 = array_values(array_filter(array_map('trim', $htmlarray)));

    // test by getting single value from array
    $arrayvalue = $htmlarray2['9'];

    // my attempt to trim string in while loop
    trim($arrayvalue);

    // doesn't trim
    echo '<hr>' . $arrayvalue . '<hr>';

    // put this here so I can see the full array
    echo '<pre>';
        print_r($htmlarray2);
    echo '</pre>';
}

As requested, here is the results of var_export($row['post_content']);

'<table class="product-description-table">
<tbody>
<tr>
<td class="item" colspan="3">Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9</td>
</tr>
<tr>
<td class="title" colspan="3"></td>
</tr>
<tr>
<td class="content"><br>
<br>
<p class="c1"></p>
<p class="c1"></p>
<strong><br></strong> <strong><br></strong> <strong>Donec Rem&nbsp;</strong><br>
<br>
<strong>Animam Urgebat<br>
<br></strong> <strong><br>
<br>
Rerum Sed 8613 - 3669 8358 & 6699<br>
<br>
1.mE (magNA) QUO Ad Nominum Statum Massa<br>
ab SEM Autem Reddet Habitu Sit<br>
<br></strong> <strong>&nbsp;PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM</strong> <strong><br></strong> <strong><br></strong> <strong>Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!</strong><strong><br></strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad Quisque Modeste</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ac Rem Wisi</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ex Hac Congue mus Leo</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ab 7/92" Alias</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad 2/73" Adverso & Erat</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;me Personom Eget</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad Viribus Fuga Fuga</strong><strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ab Louor-Sit Molles</strong><strong class="c2">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3x Block-Off Plates</strong><strong class="c2">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ad Facunda</strong><strong class="c2">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ab Personas Diam<br>
NUNC<br>
ex Teniet te Palmam Eaque<br>
me Teniet in Versus Urna<br></strong> <strong><br></strong><br>
<strong class="c3">**CONDEMNENDUS REM CUM MAGNORUM**</strong><strong>&nbsp;</strong><br></td>
<td class="product-content-border"></td>
</tr>
<tr>
<td class="gallery" colspan="3">
<table>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td class="spacer" colspan="3"></td>
</tr>
<tr>
<td class="product-content-border"></td>
</tr>
</tbody>
</table>
<br>
<br>
<br>
<p class="c4"></p>'

Final Edit :):

Posted a solution below. Not going to accept my own answer.

If anyone familiar with regex can help explain the tribulation behind all this and why this regex formula : /[\s]+/mu or rather $clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray); fixed this issue I'll gladly accept that as a proper answer and explanation.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
bbruman
  • 667
  • 4
  • 20
  • What is that array_values and array_filter doing in there? Does it work if you only use the map? Obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 – Matt May 17 '17 at 14:31
  • 1
    https://3v4l.org/PMdrH ?? – hassan May 17 '17 at 14:33
  • I'm a bit confused about which bit of this isn't working - `$htmlarray2` will have the strings with the white-space preserved (as well as some empty ones), and `$htmlarray` will have the strings without white-space. You mention a loop that isn't working, but you haven't posted one. – iainn May 17 '17 at 14:34
  • @mkaatman the purpose of that is to remove empty array elements and re-index the array, a necessity in my case as I only want an array with actual text strings and of course a proper index for it. – bbruman May 17 '17 at 14:34
  • @hassan Yup it works just fine if I have it in a string like that. I posted an example of that in my question. It's just not trimming in my while loop for whatever reason. – bbruman May 17 '17 at 14:37
  • This code is working for me: https://pastebin.ca/3813710 – Matt May 17 '17 at 14:38
  • @mkaatman same as hassan... it works fine with just a normal string variable, but is not working in my while loop / mysql fetched string! That is the issue. – bbruman May 17 '17 at 14:39
  • Run `var_export($row['post_content']);` inside your loop and update your question with the result. – Matt May 17 '17 at 14:39
  • @mkaatman yeah sure no problem, just updated the question with the result. Although, I'm not sure what info to gain from this.. – bbruman May 17 '17 at 14:47
  • My apologies, the var_export I previously updated the question with was the one that was being converted (see pastebin in the comments for the Answer for this)... I updated the question with the actual `var_export($row['post_content']);` main difference is there are a lot of ` `'s.. which is why I included a `//replace   with white space` conversion code in my script... – bbruman May 17 '17 at 18:08

3 Answers3

1

Here's your requested explanation on the regex pattern that solved your issue:

/[\s]+/ (more simply expressed as /\s+/) says "look for one or more white-space characters (this includes: ' ','\r','\n','\t','\f','\v'). The multi-line modifier/flag is not necessary because you are not using anchors (^ $) in your pattern. The unicode modifier/flag is absolutely critical in your case because your string of html text contains many little devils called...

"NO-BREAK SPACE" and is a combination of unicode characters 194 and 160 represented as \x{00A0} See them highlighted here.

Without the u flag, the NO-BREAK SPACE characters remain and additional filtering will be required to remove them.


While you eventually got your code to the right output. I'm happy to produce a leaner single-step pattern that will get you there faster purely using preg_split().

while ($row = $results->fetch_array()) {
    $texts = preg_split('/\s*<[^>]+>\s*/u', $row['post_content'], 0, PREG_SPLIT_NO_EMPTY);
    var_export($texts);
}

Here is a working regex101 demo.

This new splitting pattern still looks for your tags, but it is more efficient because between the < and >, I merely ask to match all characters that are "not >" by using [^>]+. This is much simpler for the engine versus asking to match from the long list of characters that . represents.

Furthermore, I included matching for your unicode-extended white-space characters. \s* will match zero or more white-space characters before AND after each tag.

Finally, I should explain the additional parameters on preg_split(). The 0 says "find unlimited matches" -- this is the default behavior, but I must use 0 or -1 as its value to hold its place to ensure that the final parameter is used. PREG_SPLIT_NO_EMPTY spares you having to take the extra step of using array_filter() later. It omits any empty elements generated from the split, so you only get the good stuff.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • No it was perfect. Just woke up and read this :) Very concise and helpful answer, I got a lot from it! Thinking I should definitely look more into regex as I can see it's a very powerful tool for manipulating large amounts of data – bbruman May 18 '17 at 13:40
0

Trim doesn't work in place. You want this:

$arrayvalue = trim($arrayvalue);

That's really it. Trim returns the trimmed string: it doesn't modify the variable in place.

Conor Mancone
  • 1,940
  • 16
  • 22
  • So I did `$trim = trim($arrayvalue);` followed by `echo '
    ' . $trim . '
    ';` and then `var_dump($trim);` ... still returns `$trim` with all the excess whitespace that I don't want. Like I've stated this works with normal strings.. but is not working in my while loop....
    – bbruman May 17 '17 at 15:18
  • There is definitely some strangeness going on. For starters, the line with `strtr` can be replaced with (I think): `$converted = html_entity_decode( $row['post_content'], ENT_QUOTES)` Also, this line isn't doing anything: `trim($converted, chr(0xC2).chr(0xA0));` You probably mean to say: `$converted = trim($converted, chr(0xC2).chr(0xA0));` If $converted has some wonky non-space characters at the beginning/end, then trim (line 36) won't work. The fact that you are trying (but failing) to remove some wonky non-space characters makes me think that is your next problem. – Conor Mancone May 17 '17 at 17:14
  • Thanks for the tips. Yeah, something wonky must be going on. I tried it out with your pastebin script... Here's the output: https://s4.postimg.org/8mwymfax9/localhost_cs_mysql2.php.png Same issue. Also empty node 23 is for some reason kept in the array. Additionally I added a var_dump for the trimmed (or actually not trimmed) `$arrayvalue` – bbruman May 17 '17 at 17:46
  • As for your second inquiry here's a full var_dump of `$converted` https://pastebin.com/JSMJ0n5a – bbruman May 17 '17 at 17:59
  • Posted a solution. Any idea why it works? This was overall a confusing mess :) – bbruman May 17 '17 at 22:16
0

I found a solution.

Not exactly sure how it works.. I'm quite unfamiliar with regex.

But the solution that I found (and maybe someone can explain it?) was

$clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray);

The entire script (excluding the MySQL stuff) that worked was

$converted = html_entity_decode( $row['post_content'], ENT_QUOTES);
$converted = trim($converted, chr(0xC2).chr(0xA0));

$htmlarray = preg_split('/<.+?>/', $converted);

$clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray);

$htmlarray2 = array_filter(array_map('trim', $clean_htmlarray));

$clean_htmlarray2 = array_values($htmlarray2);

echo '<pre>';
print_r($clean_htmlarray2);
echo '</pre>';

Output being

Array
(
    [0] => Saepe Encomia 2.aD NEC Mirum Populo Soluni Iis 8679-1370 Status Error Sed 9.9
    [1] => Description
    [2] => Donec Rem
    [3] => Animam Urgebat
    [4] => Rerum Sed 8613 - 3669 8358 & 6699
    [5] => 1.mE (magNA) QUO Ad Nominum Statum Massa
    [6] => ab SEM Autem Reddet Habitu Sit
    [7] => PRAEDAM ACCUMSAN PERSONARUM DENEGARE AC DUORUM
    [8] => Lius typi sit nec quo adversis cras ministri oppressa, versus class hic rem quos colubros ullo commune!economy!
    [9] => ad Quisque Modeste
    [10] => ac Rem Wisi
    [11] => ex Hac Congue mus Leo
    [12] => ab 7/92" Alias
    [13] => ad 2/73" Adverso & Erat
    [14] => me Personom Eget
    [15] => ad Viribus Fuga Fuga
    [16] => ab Louor-Sit Molles
    [17] => 3x Block-Off Plates
    [18] => ad Facunda
    [19] => ab Personas Diam
    [20] => NUNC
    [21] => ex Teniet te Palmam Eaque
    [22] => me Teniet in Versus Urna
    [23] => **CONDEMNENDUS REM CUM MAGNORUM**
)

A completely trimmed array.

This also works in my while loop for all rows, ie:

$results = $mysqli->query("SELECT ID, post_content
FROM wp_posts'
LIMIT 50;");

In this case I get all 50 rows with completely trimmed strings.

So finally... this was a challenge to figure out!

I just wish I understood it more. I don't really feel like I deserve to be confirmed as the answer to this question, as all I really did was try a BUNCH of different things and finally this worked.

If someone wants to chime in and explain why $clean_htmlarray = preg_replace('/[\s]+/mu', ' ', $htmlarray); or rather /[\s]+/mu was what I needed in this instance, I'll gladly award the answer to them :)

As for now just glad it's working properly. Thanks everyone for all the help and input with this!

bbruman
  • 667
  • 4
  • 20
  • 1
    Your regular expression is simply replacing all consecutive white space characters with a space. So for instance if it finds 5 consecutive space characters it will replace them with a single space. A regular expressions definition of "space character" is potentially broad. It includes things like tabs, newlines, etc. So a space followed by a tab followed by a new line and then a space will also get replaced by a single space. Normally newlines would be effectively ignored , but the 'm' flag to preg_replace changes that behavior. The PHP docs on preg_replace have more details about that. – Conor Mancone May 18 '17 at 12:20
  • 1
    Best bests on why it might matter is that 'u' flag to your preg_replace. The 'u' enables a unicode mode that may have a more liberal definition of what a whitespace character is. Without any params the trim function will replace a small handful of characters. If you have some non-standard space characters in unicode, trim would ignore them and leave your string unmodified. However, your preg_replace with the 'u' flag may be converting these to regular spaces, which trim can then remove. If you take the 'u' flag out, and it stops working, that is probably what is happening. – Conor Mancone May 18 '17 at 12:22