0

So I have this code:

var_dump(trim(filter_var("\nLook ma, there are special characters:\n<>\"'&©", FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_LOW | FILTER_FLAG_ENCODE_HIGH)));

Which will output this:

string(66) "&#10;Look ma, there are special characters:&#10;&#34;&#39;&&#194;&#169;"

The problem is that the enocoded character &#194; is the character Â, which was not in the original text.

My Question: Why does this happen, and how can I remove the extra  character?

Nicholas Summers
  • 4,444
  • 4
  • 19
  • 35

1 Answers1

1

It's not an extra character; it's the first byte of a multi-byte Unicode character.

You actually asked the function to do this, by giving it the FILTER_FLAG_ENCODE_LOW | FILTER_FLAG_ENCODE_HIGH flag expression.

If you don't encode "high" values, the result changes but is still not very useful:

var_dump(trim(filter_var("\nLook ma, there are special characters:\n<>\"'&©", FILTER_SANITIZE_STRING, FILTER_FLAG_ENCODE_LOW)));
//  string(61) "&#10;Look ma, there are special characters:&#10;&#34;&#39;&┬®"

What to do next really depends on your requirements. I suspect filter_var is not what you're looking for, if you want to handle Unicode characters too.

If ANSI is enough for you, I found that a quick fix was to change my PHP source file's encoding to ANSI mode (not UTF-8!), fix the now-broken "©" glyph by removing the orphaned "Â", and run the script again:

// string(65) "&#10;Look ma, there are special characters:&#10;&#34;&#39;&&#169;"

But this is kind of limiting.

Have a read through the following manual pages for more information:

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • Well that definitely explains the why (+1), I still need to know how to remove the character. I am would guess something with `mb_convert_encoding()` would fix it, however I don't know enough about character encodings to know which to convert it to. (I edit files in UTF-8 mode if that helps) – Nicholas Summers Mar 15 '17 at 16:48
  • We would need to know specifically what your goal is and specifically what your requirements are. What do you want to do with which characters? And for what reason? – Lightness Races in Orbit Mar 15 '17 at 16:49
  • The ascii range is enough for me, I don't need to support user input for characters outside of that range. (however, I do need to support UTF-8 in the source code) – Nicholas Summers Mar 15 '17 at 16:51
  • Okay but you're still not telling us what you're attempting to accomplish. – Lightness Races in Orbit Mar 15 '17 at 16:52
  • So _stripping_ high values may work for you, I'm not sure. FWIW, © is not actually ASCII. – Lightness Races in Orbit Mar 15 '17 at 16:52
  • Essentially, the only specific goal with this is to allow all ascii characters to be properly encoded (as needed). Beyond the ascii range they are stripped, and stripping the high characters is not an option. This is a part of a much larger sanitizing class used through out an application. – Nicholas Summers Mar 15 '17 at 16:55
  • Well then you have a problem because © is not ASCII. You're over-simplifying how character encoding works. – Lightness Races in Orbit Mar 16 '17 at 00:25