1

What I am trying to do is rather simple: I want to print a date (timestamp) in chinese (or russian).

For all languages I am using

setlocale(LC_TIME, 'hu_HU.utf8', 'hu_HU.UTF-8', 'hu_HU', 'hr');
$date = strftime('%a %e %b %Y, %H:%M');

$date = utf8_encode($date);

This returns an UTF-8 String even without the utf8_encode(). Everything is fine. Now when I do the exact same with the 'zh_CN.utf8' locale (or 'zh_CN.UTF-8', 'zh_CN' or 'zh') this does not return the correct date. With or without the utf8_encode() this returns

'2018å¹?mæ?#dæ?'

I don't speak chinese but this is obviously wrong. I found out that it should return something like '年'. This character has the UTF-8 hex encoding E5 B9 B4 but when I look at the returned String there are the wrong hex values. There is (after 2018) C3 A5 C2 B9 3F 6D C3 A6 ....

When I check the encoding of the returned String with mb_detect_encoding() this always returns UTF-8. I was expecting that because I am using the 'zh_CN.utf8' locale which is setting the encoding to UTF-8.

After looking around quite some time I came across this answer of Peter. He suggests using the format '%Y年%m月%e日' in the strftime() function. When I use this I get the same result as before.

This leads me to the thought that the encoding is wrong. But is this true? Is the encoding wrong? How do I convert the result to the correct encoding?

I have more less the same problem for russian language.

miile7
  • 2,547
  • 3
  • 23
  • 38
  • 1
    Stop using `utf8_encode()` it is not magic, in fact it will corrupt your input more often than not. The same goes for `utf8_decode()`. Also `mb_detect_encoding()` should be called `mb_guess_encoding()` because that's what it's doing. If using what 'Peter' suggested doesn't work then I suspect that you've not properly specified the display encoding in the page, browser, or whatever you're using to look at the output. https://stackoverflow.com/questions/279170/utf-8-all-the-way-through – Sammitch Nov 29 '18 at 18:13
  • @Sammitch I am sorry but this does not really help me. I am writing the returned content to a plain text file. There is no browser page encoding given. This is why I was able to check the hex encoding. I'm not doing that in my browser output. I also tried to add some `BOM`s so maybe I could find out by luck which encoding `strftime()` is delivering. Also I know that `mb_detect_encoding()` is just guessing. But what else can I do to get the encoding? I am guessing too. – miile7 Dec 02 '18 at 17:55
  • Google "how to view UTF8 in $editor" because that's likely still your problem. – Sammitch Dec 02 '18 at 18:17
  • @Sammitch Thank you for your help. I will try this when I'm back at this project in the next week. But I'm not too confident. I am processing the text file with another program which is set to UTF-8 encoding as input. This programm is throwing errors when I add the result of `strftime()`. This is the way I how encountered the problem. When I add the normal `年` Letter it is working. So I don't think this has anything to do about the "presentation encoding". But I will give it a try. You will hear about the result in a few days. – miile7 Dec 02 '18 at 19:00

1 Answers1

2

The solution

I spent several hours and I found the correct encodings. strftime() is not delivering an UTF-8 String. For details have a look at the bottom of this answer. I ended up with a formatTime() function which is delivering me the correct time in the correct encoding (UTF-8 for me).

function formatTime($format, $language = null, $timestamp = null){
    switch($language){
        case 'chinese':
            $locale = setlocale(LC_TIME, 'zh_CN.utf8', 'zh_CN.UTF-8', 'zh_CN', 'zh');
            break;
        case 'hungarian':
            $locale = setlocale(LC_TIME, 'hu_HU.utf8', 'hu_HU.UTF-8', 'hu_HU', 'hr');
            break;
        case 'russian':
            $locale = setlocale(LC_TIME, 'ru_RU.utf8', 'ru_RU.UTF-8', 'ru_RU', 'ru');
            break;
        case 'german':
            $locale = setlocale(LC_TIME, 'de_DE.utf8', 'de_DE.UTF-8', 'de_DE', 'de');
            break;
        case 'french':
            $locale = setlocale(LC_TIME, 'fr_FR.utf8', 'fr_FR.UTF-8', 'fr_FR', 'fr');
            break;
        case 'polish':
            $locale = setlocale(LC_TIME, 'pl_PL.utf8', 'pl_PL.UTF-8', 'pl_PL', 'pl');
            break;
        case 'turkish':
            $locale = setlocale(LC_TIME, 'tr_TR.utf8', 'tr_TR.UTF-8', 'tr_TR', 'tr');
            break;
        case 'english':
            $locale = setlocale(LC_TIME, 'en_GB.utf8', 'en_GB.UTF-8', 'en_GB', 'en');
            break;
        // ...
        default: break;
    }

    if(!is_numeric($timestamp)){
        $datetime = strftime($format);
    }
    else{
        $datetime = strftime($format, $timestamp);
    }

    $current_locale = strtolower(setlocale(LC_TIME, 0));

    if(($pos = strpos("utf", $current_locale)) === false || strpos("8", $current_locale, $pos) === false){
        // UTF-8 locale is not used, the encodings are found out with the code shown below
        $locale_default_encodings = array(
            "german" => "ISO-8859-1",
            "french" => "ISO-8859-1",
            "polish" => "ISO-8859-2",
            "turkish" => "ISO-8859-9",
            // Testing hungarian results in "Windows-1252", but php.net recommends to 
            // use ISO-8859-2, in fact Windows-1252 is based on ISO-8859-2 so it should 
            // (hopefully) work with both (*)
            "hungarian" => "ISO-8859-2", 
            "chinese" => "CP936",
            "russian" => "KOI8-R"
        );
        $target_encoding = mb_internal_encoding(); // or "UTF-8" or whatever

        if(isset($locale_default_encodings[$language])){
            $datetime = mb_convert_encoding(
                $datetime, 
                $target_encoding, 
                $locale_default_encodings[$language]
            );
        }
        else{
            // try to avoid this case
            $datetime = mb_convert_encoding($datetime, $target_encoding);
        }
    }

    setlocale(LC_TIME, $locale);

    return $datetime;
}

(*): http://php.net/manual/de/function.strftime.php#94399

The long long way

I checked out the strftime("%B") result for the specific language. This is the full month name. I checked the translation for my languages, then I looked up the hex values for UTF-8 for the different letters of the translation.

Now I am iterating through all the encodings that are supported by php. I convert the result given by strftime() from the current iterated encoding to UTF-8. Now I can compare result of strftime() converted to UTF-8 to the hex values of the manual translations which are also the hex values for UTF-8. If they match the result of strftime() has the encoding of the current interated encoding.

I choose the hex values because they defenetly are the same and do not depend on the internal encoding because they are ASCII Strings (or even numbers in php).

This gives me the following output, the code is posted below:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    </head>
    <body>
        <h1>Detecting the font encoding of <code>strftime()</code>
        </h1>
        <h2>hungarian</h2>
        <p>
            <code>strftime()</code> for March for language hungarian. Expected hex:  <code>6fc5be756a616b</code>, converted expected hex to string: <code>ožujak</code>
        </p>
        <table>
            <tr>
                <td>initial return value</td>
                <td>oߵjak</td>
                <td>6f9e756a616b</td>
            </tr>

            <tr>
                <td colspan='3'>Encodings that deliver the correct result:</td>
            </tr>
            <tr style='background: green;'>
                <td>Windows-1252</td>
                <td>ožujak</td>
                <td>6fc5be756a616b</td>
            </tr>
        </table>
        <h2>chinese</h2>
        <p>
            <code>strftime()</code> for December for language chinese. Expected hex:  <code>e58d81e4ba8ce69c88</code>, converted expected hex to string: <code>十二月</code>
        </p>
        <table>
            <tr>
                <td>initial return value</td>
                <td>ʮ׾Ղ</td>
                <td>caaeb6fed4c2</td>
            </tr>

            <tr>
                <td colspan='3'>Encodings that deliver the correct result:</td>
            </tr>
            <tr style='background: green;'>
                <td>EUC-CN</td>
                <td>十二月</td>
                <td>e58d81e4ba8ce69c88</td>
            </tr>
            <tr style='background: green;'>
                <td>CP936</td>
                <td>十二月</td>
                <td>e58d81e4ba8ce69c88</td>
            </tr>
            <tr style='background: green;'>
                <td>GB18030</td>
                <td>十二月</td>
                <td>e58d81e4ba8ce69c88</td>
            </tr>
        </table>
        <h2>russian</h2>
        <p>
            <code>strftime()</code> for December for language russian. Expected hex:  <code>d0b4d095d099d0aed090d09fd0ad</code>, converted expected hex to string: <code>дЕЙЮАПЭ</code>
        </p>
        <table>
            <tr>
                <td>initial return value</td>
                <td>ť롡td>
                <td>c4e5eae0e1f0fc</td>
            </tr>

            <tr>
                <td colspan='3'>Encodings that deliver the correct result:</td>
            </tr>
            <tr style='background: green;'>
                <td>KOI8-R</td>
                <td>дЕЙЮАПЭ</td>
                <td>d0b4d095d099d0aed090d09fd0ad</td>
            </tr>
            <tr style='background: green;'>
                <td>KOI8-U</td>
                <td>дЕЙЮАПЭ</td>
                <td>d0b4d095d099d0aed090d09fd0ad</td>
            </tr>
        </table>
    </body>
</html>

Note that this html is encoded in UTF-8. Still the result given by the strftime() function is wrong! This has nothing to do with the browser or editor encoding as pointed out in the comments.

$encodings = array(
    "UCS-4",
    "UCS-4BE",
    "UCS-4LE",
    "UCS-2",
    "UCS-2BE",
    "UCS-2LE",
    "UTF-32",
    "UTF-32BE",
    "UTF-32LE",
    "UTF-16",
    "UTF-16BE",
    "UTF-16LE",
    "UTF-7",
    "UTF7-IMAP",
    "UTF-8",
    "ASCII",
    "EUC-JP",
    "SJIS",
    "eucJP-win",
    "SJIS-win",
    "ISO-2022-JP",
    "ISO-2022-JP-MS",
    "CP932",
    "CP51932",
    "SJIS-mac",
    "SJIS-Mobile#DOCOMO",
    "SJIS-Mobile#KDDI",
    "SJIS-Mobile#SOFTBANK",
    "UTF-8-Mobile#DOCOMO",
    "UTF-8-Mobile#KDDI-A",
    "UTF-8-Mobile#KDDI-B",
    "UTF-8-Mobile#SOFTBANK",
    "ISO-2022-JP-MOBILE#KDDI",
    "JIS",
    "JIS-ms",
    "CP50220",
    "CP50220raw",
    "CP50221",
    "CP50222",
    "ISO-8859-1",
    "ISO-8859-2",
    "ISO-8859-3",
    "ISO-8859-4",
    "ISO-8859-5",
    "ISO-8859-6",
    "ISO-8859-7",
    "ISO-8859-8",
    "ISO-8859-9",
    "ISO-8859-10",
    "ISO-8859-13",
    "ISO-8859-14",
    "ISO-8859-15",
    "ISO-8859-16",
    "byte2be",
    "byte2le",
    "byte4be",
    "byte4le",
    "BASE64",
    "HTML-ENTITIES",
    "7bit",
    "8bit",
    "EUC-CN",
    "CP936",
    "GB18030",
    "HZ",
    "EUC-TW",
    "CP950",
    "BIG-5",
    "EUC-KR",
    "UHC",
    "ISO-2022-KR",
    "Windows-1251",
    "Windows-1252",
    "CP866",
    "KOI8-R",
    "KOI8-U",
    "ArmSCII-8"
);

$show_wrong_encodings = false;
$internal_encoding = "UTF-8";
mb_internal_encoding($internal_encoding);

$languages = array(
    // name of the language => hex in UTF-8 and timestamp to check
    "german" => array("4dc3a4727a", 1520343439), // march
    "french" => array("64c3a963656d627265", 1544103703), // december
    "polish" => array("677275647a6965c584", 1544103703), // december
    "turkish" => array("4172616cc4b16b", 1544103703), // december
    "hungarian" => array("6fc5be756a616b", 1520343439), // march
    "chinese" => array("e58d81e4ba8ce69c88", 1544103703), // december
    "russian" => array("d0b4d095d099d0aed090d09fd0ad", 1544103703) // december
);

$format = "%B"; // print full month name
print("<h1>Detecting the font encoding of <code>strftime()</code></h1>\n");

foreach($languages as $language => $data){
    // the hex value in UTF-8, this is the target value
    $hex = $data[0];
    // the timestamp to check
    $timestamp = $data[1];

    print(
        "<h2>".$language."</h2>\n".
        "<p>".
            "<code>strftime()</code> for ".formatTime("%B", "english", $timestamp)." ".
            "for language ".$language.". Expected hex:  <code>".$hex."</code>, converted expected ".
            "hex to string: <code>".tostring($hex)."</code>".
        "</p>\n"
    );

    // this is a different formatTime() function than mentioned above, it is defined after this 
    // foreach
    $string = formatTime("%B", $language, $timestamp);

    print("<table>\n");
    print("<tr>\n".
            "\t<td>initial return value</td>\n".
            "\t<td>".$string."</td>\n".
            "\t<td>".tohex($string)."</td>\n".
        "</tr>\n\n".
        "<tr><td colspan='3'>Encodings that deliver the correct result:</td></tr>"
    );

    foreach($encodings as $source_encoding){
        $converted = mb_convert_encoding($string, $internal_encoding, $source_encoding);
        $converted_hex = tohex($converted);

        $style = "";
        if($converted_hex == $hex){
            $style = "background: green";
        }
        elseif(!$show_wrong_encodings){
            $style = "display: none";
        }

        print("<tr style='".$style.";'>\n".
                "\t<td>".$source_encoding."</td>\n".
                "\t<td>".$converted."</td>\n".
                "\t<td>".$converted_hex."</td>\n".
            "</tr>\n"
        );
    }
    print("</table>");
}

function tohex($string){
    return implode(unpack("H*", $string));
}

function tostring($hex){
    return pack("H*", $hex);
}

function formatTime($format, $language, $timestamp){
    switch($language){
        case 'chinese':
            $locale = setlocale(LC_TIME, 'zh_CN.utf8', 'zh_CN.UTF-8', 'zh_CN', 'zh');
            break;
        case 'hungarian':
            $locale = setlocale(LC_TIME, 'hu_HU.utf8', 'hu_HU.UTF-8', 'hu_HU', 'hr');
            break;
        case 'russian':
            $locale = setlocale(LC_TIME, 'ru_RU.utf8', 'ru_RU.UTF-8', 'ru_RU', 'ru');
            break;
        case 'german':
            $locale = setlocale(LC_TIME, 'de_DE.utf8', 'de_DE.UTF-8', 'de_DE', 'de');
            break;
        case 'french':
            $locale = setlocale(LC_TIME, 'fr_FR.utf8', 'fr_FR.UTF-8', 'fr_FR', 'fr');
            break;
        case 'polish':
            $locale = setlocale(LC_TIME, 'pl_PL.utf8', 'pl_PL.UTF-8', 'pl_PL', 'pl');
            break;
        case 'turkish':
            $locale = setlocale(LC_TIME, 'tr_TR.utf8', 'tr_TR.UTF-8', 'tr_TR', 'tr');
            break;
        // ...
        default:
            $locale = setlocale(LC_TIME, 'en_GB.utf8', 'en_GB.UTF-8', 'en_GB', 'en');
            break;
    }

    $datetime = strftime($format, $timestamp);
    setlocale(LC_TIME, $locale);

    return $datetime;
}
miile7
  • 2,547
  • 3
  • 23
  • 38
  • 3
    The short answer is really: *make sure your system has the UTF-8 variant of the desired locale installed…* – deceze Dec 06 '18 at 14:50