13

While trying to run a string through PHP's htmlentities function, I have some cases where I get a 'Invalid Multibyte Sequence' error. Is there a way to clean the string prior to calling the function to prevent this error from occuring?

GSto
  • 41,512
  • 37
  • 133
  • 184

7 Answers7

10

As of PHP 5.4 you should use something along the following to properly escape output:

$escapedString = htmlspecialchars($string, ENT_QUOTES | ENT_SUBSTITUTE | ENT_DISALLOWED | ENT_HTML5, $stringEncoding);

ENT_SUBSTITUTE replaces invalid code unit sequences by � (instead of returning an empty string).

ENT_DISALLOWED replaces code points that are invalid in the specified doctype with �.

ENT_HTML5 specifies the used doctype. Depending on what you are using you may choose ENT_HTML401, ENT_XHTML or ENT_XML1.

Using those options you make sure that the result is always valid in the given doctype, regardless of the kind of abominated input you get.

Also, don't forget to specify the $stringEncoding. Relying on the default is a bad idea as it depends on ini settings and may (and did) change between versions.

NikiC
  • 100,734
  • 37
  • 191
  • 225
  • The PHP documentation is unclear about it, but `ENT_HTML5` is redundant for htmlspecialchars. See http://stackoverflow.com/a/14532168/427545 – Lekensteyn Jan 26 '13 at 00:29
  • 3
    @Lekensteyn `ENT_HTML5` is not redundant, especially when `ENT_DISALLOWED` is used. It will replace code points that are invalid in the HTML5 doctype with the Unicode Replacement Character. E.g. see this example: http://codepad.viper-7.com/q5bPMQ The `ENT_HTML5 | ENT_DISALLOWED` makes sure that the output does not contain any invalid codepoints. – NikiC Jan 26 '13 at 13:37
  • Thanks for the correction, I have expanded my answer to take invalid characters into account. At first I did not know the difference between DISALLOWED and SUBTITUTE, but now it has become clear to me. – Lekensteyn Jan 26 '13 at 15:11
9

I've encountered scenarios where it's not enough to specify UTF-8 and found the ENT_IGNORE option useful. I don't think it's documented for htmlentities, only for htmlspecialchars but it does work in stifling the error.

Ted
  • 551
  • 1
  • 7
  • 16
  • 3
    I know this is an old topic but I came across this problem too and thought it might be worth noting that the use of ENT_IGNORE is not recommended as it may have security implications: http://unicode.org/reports/tr36/#Deletion_of_Noncharacters – Dean Mar 09 '12 at 07:26
  • Yeah, ENT_IGNORE was the only fix (/hack) that I found for this issue, at the moment. – Kzqai Jul 24 '12 at 18:35
  • Better try to fix your UTF8. In my case, I used `substr()` on the string, which produced invalid UTF8. Using `mb_substr()` instead fixed my issue. `ENT_IGNORE` would have worked as well, but it is not a clean solution. – Christopher K. Aug 15 '18 at 09:03
8

For PHP 5.3.0 and below, the default charset for htmlentities() is ISO-8859-1. (Manual)

You are probably applying it to a UTF-8 string. Specify the character set using

htmlentities($string, (whatever), "UTF-8");

Since PHP 5.4.0, the default charset is UTF-8.

Yi Jiang
  • 49,435
  • 16
  • 136
  • 136
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
6

In general the php ini setting display_errors can be used to control whether errors are output to the browser, the ini setting log_errors can be independently used to control whether errors are written to logfile, and if a custom error handler has been set with set_error_handler() then this is always called for all errors and can then read the values of display_errors and log_errors along with the value of error_reporting() and take the appropriate course of action, right?

Wrong! In this case, htmlspecialchars() and htmlentities() only trigger the error if the value of display_errors is false. If the value of display_errors is true then no error is triggered at all! This seemingly nonsensical behaviour makes it impossible to detect these errors during debugging with display_errors on.

I got the information from here

Paul DelRe
  • 4,003
  • 1
  • 24
  • 26
Patrice Peyre
  • 149
  • 2
  • 5
  • Thanks for pointing this out - it explains why I only saw this error on production! I couldn't figure out why, on my development box, where all error reporting is turned _ON_, I couldn't reproduce the error. – thaddeusmt Jan 03 '14 at 21:41
2

Do you use substr somewhere in the string you want to check. I suggest then to use mb_substr as an alternative. The problem is that substr is not unicode aware. So, it is just chopping off bytes in your multi byte character set.

-1

html_entities($variable, ENT_QUOTES); always works just fine for me.

  • The default encoding in some versions of php is iso-something-something, and only later in php 5.4 is it utf-8. Note that regardless, it is not consistent across versions, so it's probably best to specify the encoding to match whichever encoding is actually in use. – Kzqai Jul 30 '12 at 12:55
-2

Note that using utf-8 requires enabling multibyte string functions. This could mean replacing functions like substr with mb_substr, except that php provides a php ini setting to turn on overloading of those functions with the mb equivalent.

See here for more detail: http://www.php.net/manual/en/mbstring.overload.php

Kzqai
  • 22,588
  • 25
  • 105
  • 137