9

We are dealing with a strange bug in a Joyent Solaris server that never happened before (doesn't happen in localhost or two other Solaris servers with identical php configuration). Actually, I'm not sure if we have to look at php or solaris, and if it is a software or hardware problem...

I just want to post this in case somebody can point us in the right direction.

So, the problem seems to be in var_export()when dealing with strange characters. Executing this in the CLI, we get the expected result in our localhost machines and in two of the servers, but not in the 3rd one. All of them are configured to work with utf-8.

$ php -r "echo var_export('ñu', true);"

Gives this in older servers and localhost (expected):

'ñu'

But in the server we are having problems with (PHP Version => 5.3.6), it adds \0 null characters whenever it encounters an "uncommon" character: è, á, ç, ... you name it.

'' . "\0" . '' . "\0" . 'u'

Any idea on where should be looking at? Thanks in advance.


More info:

  • PHP version 5.3.6.
  • setlocale() is not solving anything.
  • default_charset is UTF-8 in php.ini.
  • mbstring.internal_encoding is set to UTF-8 in php.ini.
  • mbstring.func_overload = 0.
  • this happens in both CLI (example) and web application (php-fpm + nginx).
  • iconv encoding is also UTF-8
  • all files utf-8 encoded.

system('locale') returns:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

Some of the tests done so far (CLI):

Normal behaviour:

$ php -r "echo bin2hex('ñu');" => 'c3b175'
$ php -r "echo mb_strtoupper('ñu');" => 'ÑU'
$ php -r "echo serialize(\"\\xC3\\xB1\");" => 's:2:"ñ";'
$ php -r "echo bin2hex(addcslashes(b\"\\xC3\\xB1\", \"'\\\\\"));" => 'c3b1'
$ php -r "echo ucfirst('iñu');" => 'Iñu'

Not normal:

$ php -r "echo strtoupper('ñu');" => 'U' 
$ php -r "echo ucfirst('ñu');" => '?u' 
$ php -r "echo ucfirst(b\"\\xC3\\xB1u\");" => '?u' 
$ php -r "echo bin2hex(ucfirst('ñu'));" => '00b175'
$ php -r "echo bin2hex(var_export('ñ', 1));" => '2727202e20225c3022202e202727202e20225c3022202e202727'
$ php -r "echo bin2hex(var_export(b\"\\xC3\\xB1\", 1));" => '2727202e20225c3022202e202727202e20225c3022202e202727'

So the problem seems to be in var_export() and "string functions that use the current locale but operate byte-by-byte" Docs (view @hakre's answer).

hakre
  • 193,403
  • 52
  • 435
  • 836
eillarra
  • 5,027
  • 1
  • 26
  • 32
  • I'd start by checking the version of software running on each server. Specifically php. A function in one version assumes UTF-8 while the same function in a different version assumes ISO-8859-1. – Tim Gautier Mar 16 '12 at 17:25
  • Also try comparing the output of `locale(1)` and/or checking the environment variables that start with `LC`. – Martin Carpenter Mar 20 '12 at 07:31
  • Does this only happen on the CLI? That may be some special case of how Solaris' terminal handles Unicode. Or does this happen as well when running from source code files which guaranteed do not contain `NUL` bytes? – deceze Apr 11 '12 at 13:38
  • Check two things, one the php.ini that gets executed at CLI (might differ from the one over webserver), setting there the default_charset to "utf-8". Secondly check /etc/locale.gen if you even have an en_US.UTF-8 on that one server. – Christian Riesen Apr 12 '12 at 12:05
  • @deceze this happens in both CLI (example) and web application (php-fpm + nginx). – eillarra Apr 12 '12 at 12:41
  • @Christian `php.ini`'s default_charset is `UTF-8`. – eillarra Apr 12 '12 at 12:55
  • Interspersing characters with NULs means that you're probably looking at UTF-16. – Ignacio Vazquez-Abrams Apr 13 '12 at 20:00
  • @DavidBélanger default iconv encoding in the server now is `ISO-8859-1`. I changed it to `UTF-8` some days ago in `php.ini` but the problem remained, so I reverted it to the original configuration... – eillarra Apr 13 '12 at 20:06
  • @DavidBélanger double checked :) iconv encoded as `UTF-8` doesn't change a thing, both `var_export()` and `ucfirst()` don't work. – eillarra Apr 13 '12 at 20:19
  • If it's available on your server, try detecting the actual encoding with `mb_detect_encoding()` – Niko Apr 14 '12 at 09:09
  • @Niko mbstring extension is available in `php.ini`, and `mbstring.internal_encoding` is set to `UTF-8`. `mb_detect_encoding('ñu')` returns `UTF-8`. – eillarra Apr 14 '12 at 10:52
  • @eillarra And how about `mb_detect_encoding(ucfirst('ñu'))`? – Niko Apr 14 '12 at 10:54
  • @Niko `mb_detect_encoding(ucfirst('ñu'))` returns `false`. – eillarra Apr 14 '12 at 11:00
  • `mb_detect_encoding` is broken by design. Don't rely on that function, it's outcome does not say much. Handle with care. – hakre Apr 14 '12 at 12:27
  • I think your best bet, is to check if the string has valid caracters... yesterday I was trying to htmlentities a string and the result was NULL but my var had some char from word (that what I found)... so try to encodo something else manually. – David Bélanger Apr 16 '12 at 15:10
  • @DavidBélanger thanks for the comment, but string is 'clean', and as you can see the problem happens both in CLI and server (all tests included are in CLI). – eillarra Apr 16 '12 at 18:47
  • @eillarra Okay, it's very weird... I try to think why but it's hard when I am not in the hood... – David Bélanger Apr 16 '12 at 18:48
  • 1
    I'm sure this is related to Solaris and the system C libraries that are used by PHP. I'd say that the compiled packages have been messed by the hoster, otherwise `strtoupper` must be working. Get proper binaries. – hakre Apr 16 '12 at 21:18

5 Answers5

6

I suggest you verify the PHP binary you've got problems with. Check the compiler flags and the libraries it makes use of.

Normally PHP internally uses binary strings, which means that functions like ucfirst work byte-to-byte and only support what your locale support (if and like configured). See Details of the String TypeDocs.

$ php -r "echo ucfirst('ñu');" 

returns

?u

This makes sense, ñ is

LATIN SMALL LETTER N WITH TILDE (U+00F1)    UTF8: \xC3\xB1

You have some locale configured that makes PHP change \xC3 into something else, breaking the UTF-8 byte-sequence and making your shell display the � replacement characterWikipedia.

I suggest if you really want to analyze the issues, you should start with hexdumps next to how things get displayed in shell and elsewhere. Know that you can explicitly define binrary strings b"string" (that's forward compatibility, mabye you've got enabled some compile flag and you're on unicode experimental?), and also you can write strings literally, here hex-way for UTF-8:

 $ php -r "echo ucfirst(b\"\\xC3\\xB1u\");"

And there are a lot more settings that can play a role, I started to list some points in an answer to Preparing PHP application to use with UTF-8.


Example of a multibyte ucfirst variant:

/**
 * multibyte ucfirst
 *
 * @param string $str
 * @param string|null $encoding (optional)
 * @return string
 */
function mb_ucfirst($str, $encoding = NULL)
{
    $first = mb_substr($str, 0, 1, $encoding);
    $rest = mb_substr($str, 1, strlen($str), $encoding);
    return mb_strtoupper($first, $encoding) . $rest;
}

See mb_strtoupperDocs and as well mb_convert_caseDocs.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • I've made the 'hexadecimal test': all the servers, including the 'bad guy', return `c3b175` when executing `$ php -r "echo bin2hex('ñu');"`. Not sure how I should interpret this... – eillarra Apr 14 '12 at 10:39
  • And `$ php -r "echo ucfirst(b\"\\xC3\\xB1u\");"` returns `?u`. – eillarra Apr 14 '12 at 10:48
  • and what does `bin2hex(ucfirst('ñu'));` give? (your report show that for both cases, PHP uses the UTF-8 sequences inside the strings, so that is the same across those systems). – hakre Apr 14 '12 at 12:04
  • `bin2hex(ucfirst('ñu'));` returns `00b175`. – eillarra Apr 14 '12 at 12:09
  • On the system that's broken I guess. You locale is changing `0xC3` to `0x00`. What does it give on a system that works? And keep in mind you *need* to use `mb_ucfirst('ñu', 'UTF-8')` (multibyte replacement for `ucfirst`) anyway, because it's a multibyte charset you use (UTF-8). – hakre Apr 14 '12 at 12:11
  • Yes, `00b175` is in the 'broken' server. The others just return `c3b175`. Can't use `mb_ucfirst()`, are you sure that function exists? Is not here... http://www.php.net/manual/en/ref.mbstring.php – eillarra Apr 14 '12 at 12:14
  • That `mb_ucfirst` was virtual, I'll add one to the answer for an example. – hakre Apr 14 '12 at 12:22
  • Yes, I see what you mean: I added a `mb_strtoupper()` test to the question. Results differ in the 'broken' server. – eillarra Apr 14 '12 at 12:24
  • @eillarra: Can you please give the output of `uname -a`? – hakre Apr 16 '12 at 15:32
  • `SunOS 7aaefcc.local 5.11 joyent_20120126T071347Z i86pc i386 i86pc Solaris`. As I say in the question is a Joyent machine, SmartOS they call it (based in Solaris), w/ PHP 5.3.6 – eillarra Apr 16 '12 at 15:44
  • 1
    Well, as this is related to `strtoupper` even, I suspect it's related to the underlying c libs when PHP has been compiled. You should check with Joynet support and ask them for providing a properly configured/compiled binary. Also I suggest you get a PHP version that's a more current PHP 5.3 one, like PHP 5.3.10. – hakre Apr 16 '12 at 16:04
0

try force utf-8 in php:

<? ini_set( 'default_charset', 'UTF-8' ); ?>

in very top (first line of code) of your any page/template. It helps me with my special characters mostly. Not sure that it can help you too, try it.

Dudeist
  • 363
  • 1
  • 4
  • 13
0

Probably all your servers are in good state . In one of the comments you said that you have only issue with ucfirst() and var_export(). Depending on these responses you might be looking at this SOQ. Most of the php string function will not work properly when working with multibyte strings. That is why php has separate set of functions to deal with them.

This might be helpful

Community
  • 1
  • 1
sakhunzai
  • 13,900
  • 23
  • 98
  • 159
0

I normally use utf8_encode('ñu') for all the french characters

vinay rajan
  • 351
  • 4
  • 9
  • Thanks Vinay, but it seems to be an underlying C problem, maybe a compilation problem. Still trying to find it out, but PHP doesn't seem to be the source of the problem. – eillarra Apr 18 '12 at 21:30
0

phpunit tests for this are being added to https://gist.github.com/68f5781a83a8986b9d30 - can we build up a better unit test suite so that we can figure out what the expected output should be?