2

I'm trying to use strip_tags and trim to detect if a string contains empty html?

$description = '<p>&nbsp;</p>';

$output = trim(strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));

var_dump($output);

string 'Â ' (length=2)

My debug to try figure this out:

$description = '<p>&nbsp;</p>';

$test = mb_detect_encoding($description);
$test .= "\n";
$test .= trim(strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
$test .= "\n";
$test .= html_entity_decode($description, ENT_QUOTES, 'UTF-8');

file_put_contents('debug.txt', $test);

Output: debug.txt

ASCII
 
<p> </p>
John Magnolia
  • 16,769
  • 36
  • 159
  • 270

1 Answers1

4

If you use var_dump(urlencode($output)) you'll see that it outputs string(6) "%C2%A0" hence the charcodes are 0xC2 and 0xA0. These two charcodes are unicode for "non-breaking-space". Make sure your file is saved in UTF-8 format and your HTTP headers are UTF-8 format.

That said, to trim this character you can use regex with the unicode modifier (instead of trim):

DEMO:

<?php

$description = '<p>&nbsp;</p>';

$output = trim(strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));

var_dump(urlencode($output)); // string(6) "%C2%A0"

// -------

$output = preg_replace('~^\s+|\s+$~', '', strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));

var_dump(urlencode($output)); // string(6) "%C2%A0"

// -------

$output = preg_replace('~^\s+|\s+$~u', '', strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
// Unicode! -----------------------^

var_dump(urlencode($output)); // string(0) ""

Regex autopsy:

  • ~ - the regex modifier delimiter - must be before the regex, and then before the modifiers
  • ^\s+ - the start of the string immediately followed by one or more whitespaces (one or more whitespace characters in the start of the string) - (^ means start of the string, \s means a whitespace character, + means "matched 1 to infinity times")
  • | - OR
  • \s+$ - one or more whitespace characters immediately followed by the end of the string (one or more whitespace characters in the end of the string)
  • ~ - the ending regex modifier delimiter
  • u - the regex modifier - here using the unicode modifier (PCRE_UTF8) to make sure we replace unicode whitespace characters.
Community
  • 1
  • 1
h2ooooooo
  • 39,111
  • 8
  • 68
  • 102