0

I'm currently working on a project where I need to sequentially (256 byte) read a file that contains language information. So the string offset 0 starts the value for the language code 1, the offset 256 starts the value for the language code 2, ...

I don't exactly understand the encoding of the file though. The author says the file is encoded in Unicode, which is confirmed by opening it in Notepad++ which identifies it as UCS2 LE w/o BOM.

I'm trying to convert the text before splitting it into 255 byte long chunks like so:

$content = mb_convert_encoding($content, 'UTF-8', 'UCS-2LE');

This produces values like "Пользователь заблокирован". I know this file is Russian, so this looks promising. However there are still values that appear incorrect:

"┐. ð¢ð░Ðüð¥Ðü ÐëðÁð╗ð¥Ðçð©       ð£ð░"

Converting it with this code produces the same result:

$content = iconv('UTF-16', 'UTF-8', $content);

Here's the different encodings I've gotten from different sources:

Author:
    "Unicode"

file -i <FILENAME>
    "<FILENAME>: application/octet-stream; charset=binary"

mb_detect_encoding($content);
    "UTF-8"

Notepad++:
    "UCS-2 LE w/o BOM"

And here is a part of the file (extracted via vi, newlines added for clarity):

^_^D>^D;^D=^DK^D9^D ^@0^D4^D@^D5^DA^D ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@L^@a^@n^@g^@u
^@a^@g^@e^@ ^@S^@p^@r^@a^@c^@h^@e^@ ^@L^@a^@n^@g^@u^@e^@ ^@L^@i^@n^@g^@u^@a^@ 
^@I^@d^@i^@o^@m^@a^@ ^@/^D7^DK^D:^D ^@B^@a^@h^@a^@s^@a^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ 
^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@ ^@^P^D4^D@^D5^DA^D ^@=^D0^D7^D=
^D0^DG^D5^D=^D8^DO^D ^@

How am I supposed to read this file and convert it to the correct encoding with PHP? And which encoding is it now? Thanks in advance!

padarom
  • 3,529
  • 5
  • 33
  • 57

2 Answers2

0

Your test with $content = iconv('UTF-16', 'UTF-8', $content); is good but it's not only UTF-16 but UTF-16LE

<?php
    $content = file_get_contents('ru.txt');
    $content = iconv('UTF-16LE', 'UTF-8', $content);
?>
<html>
<head>
    <title>encodage</title>
    <meta charset="UTF-8">
</head>
<body>
    <?php
        echo $content;
    ?>
</body>
</html>

I'm not able to know if it's good (not able to understand russian) but it's my output :

Полный адрес Language Sprache Langue Lingua Idioma Язык Bahasa Адрес назначения ...

EDIT : For know the encoding, I only use tortoise. I select 2 file (ru.txt and other) and I do a comparaison file. And tortoise show the encoding. Look at the screen :

enter image description here

Xenofexs
  • 511
  • 4
  • 14
0

It appears as if the encoding was not the issue, it was the split afterwards. I used str_split to convert the resulting string into an array with equal length entries. I have not realized however, that the documentation notes the following:

str_split() will split into bytes, rather than characters when dealing with a multi-byte encoded string.

Using wc -c and wc -m I figured out that the character count of the resulting elements was the same, but the byte count wasn't. So str_split at some points split characters in between bytes.

I have not found any built-in function that splits a multibyte string by bytes, so I used a function similar to the one posted here.

Community
  • 1
  • 1
padarom
  • 3,529
  • 5
  • 33
  • 57
  • I don't understand why my answer ins'nt good for you. The result is not good ? And for split the string, you can't dev your own function for parse the string and cut it on array ? – Xenofexs Apr 21 '16 at 13:09
  • I never said your answer isn't good, what I'm saying with this is, that it had nothing to do with the encoding, but rather the split that followed. You couldn't have known however, as I didn't write how I split the text because I thought it wasn't important. – padarom Apr 21 '16 at 13:12