Non-deterministic gettext behavior in PHP

Question

I am using gettext in PHP and I have very strange problem. When I keep reloading the page, the strings are not translated to English most of the time, but sometimes they are!!!

My code goes like this (it is part of bigger code though):

bindtextdomain ('messages', dirname(__FILE__) . '/lang/nocache'); # https://stackoverflow.com/a/13629035 (nevim jestli je to u me problem, ale pro jistotu
bindtextdomain ('messages', dirname(__FILE__) . '/lang');
textdomain('messages');
bind_textdomain_codeset('messages', 'UTF-8');

$rv = setlocale(LC_MESSAGES, 'en_US.UTF-8');
var_dump($rv); # always returns "en_US.UTF-8"

echo "<br>TEST: " . _("Úvod");
echo "<br>TEST: " . _("Výsledky");
echo "<br>TEST: " . gettext("Úvod");

When I just cut this code snippet to a separate PHP file, the text always gets translated. I don't suspect my bigger code though of producing this non-determinism, since this piece of code inside is compact and not affected by other parts. I suspect that this strange non-determinism only appears under certain conditions, and unfortunatelly these conditions go off when I put it in a small test file (you must know this Murphy law nightmare very well :-)).

I don't know if it is a gettext cache problem, but I haven't changed the translations. I tried the recommended nocache dir trick and I also tried to restart Apache (Apache/2.4.10 (Debian)), but it didn't help. My PHP version is 5.6.12-0+deb8u1.

How can this be non-deterministic? Where is the problem?

EDIT: encoding in /etc/php5/apache2/php.ini:

; PHP's default character set is set to UTF-8.
; http://php.net/default-charset
default_charset = "UTF-8"

; PHP internal character encoding is set to empty.
; If empty, default_charset is used.
; http://php.net/internal-encoding
;internal_encoding =

; PHP input character encoding is set to empty.
; If empty, default_charset is used.
; http://php.net/input-encoding
;input_encoding =

; PHP output character encoding is set to empty.
; If empty, default_charset is used.
; mbstring or iconv output handler is used.
; See also output_buffer.
; http://php.net/output-encoding
;output_encoding =

Unfortunately if you can't reproduce the problem with this code we can't be expected to either. See http://sscce.org/ for some pointers. If the problem doesn't manifest itself in this code then *occam's razor* applies and it's most likely that the problem isn't in this code at all, but somewhere else. Possibly not even related to `gettext`. Try seeking elsewhere. — Sherif, Oct 25 '15 at 09:15
@Sherif I know SSCCE principle very well and I am using it where possible. But this problem is non-deterministic (**why the translation sometimes works and sometimes not, based just on reload when nothing else changes??**) so it failed. But hopefully someone who had this experience with gettext will be able to answer. — Tomas, Oct 25 '15 at 09:19
Right, I'm suggesting that if isolating the code in question does not reproduce the bug then perhaps the bug has nothing to do with this code. Inconsistent bugs are a sign of relying on state. I posit that somewhere in the process of trying to isolate the code in question you may have inadvertently also removed the part that causes the bug in the first place. No one is saying you are incompetent. I'm just suggesting that it's obvious to me you are looking in the wrong place. — Sherif, Oct 25 '15 at 09:24
FWIW, I can suggest that you replace `dirname(__FILE__)` with `__DIR__`, but I see nothing obviously wrong with your code to cause this problem. Perhaps some step-through in the debugger may help you more narrowly isolate this problem. May I suggest trying [phpdbg](http://phpdbg.com/docs)? — Sherif, Oct 25 '15 at 09:27
I'd suspect an encoding problem, like a call to *mb_...* (eg *mb_internal_encoding*), or the php.ini config - or lack of ... If the small php file is run on the command line, the CLI php.ini is used, its config may not be like Apache one. Having a discrepancy between how internally strings are coded, and the way they're interpreted (utf8) could explain non-determinism: UTF8 requires up to 4 bytes per character, if the literal encoding is not utf8, the UTF8 parsing may "jump" over the literal and find a '\0' (end of string internally: match), sometimes no (no match in the mapping table). — Déjà vu, Oct 25 '15 at 09:58
@ringø, the small php file was tested using browser, i.e. not on the command line. Please see my update, I have attached my php.ini charset configuration. I am not calling *mb_...* functions... But I don't understand your explanation of the non-determinism. Where does it come from? What is different between the separate runs of the exactly the same code? — Tomas, Oct 25 '15 at 10:12
Well it comes from my own experience from C and PHP behavior is sometimes very C-like (it's written in C and doesn't always perform all the necessary checks). Basically if a string encoding allows a 1-byte char int value to be in the [128,224[ range, and then it's UTF8 parsed, UTF8 says the byte in this range makes a char of two bytes, and the parser could "eat" the 2nd byte (eg `(int)0`) - not sure it applies, but that could lead to a UB (3rd byte being 0 or no...). That's only suppositions. — Déjà vu, Oct 25 '15 at 12:30

score 0 · Accepted Answer · answered Oct 30 '15 at 19:16

There was indeed a problem in the directory, because the code with __FILE__ was in different directory and I included it and didn't realize that the __FILE__ will then give wrong path. So I hardcoded the path instead:

bindtextdomain('messages', './lang/nocache'); # http://stackoverflow.com/a/13629035 
bindtextdomain('messages', './lang');

But what is really strange is that even with the wrong directory, it sometimes worked!!! This is really treacherous.

Non-deterministic gettext behavior in PHP

1 Answers1