0

I cannot get cyrillic characters in php from a .txt file with unknown encoding. I tried almost everything I could find on the web. What php function do I need to use get the contents of this file?

https://www.dropbox.com/s/w7cex4wiogyytvm/100004-6.txt

EDIT

Input:

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    debug($string);

Output: debug is broken, if I try to save the value to database it fails (BOM does some trouble and the value cannot be saved).

Input

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    $string = mb_convert_encoding ($string , 'utf-8');
    debug($string);

Output:

    '????? ???:300/500V
    ???? ???:2000V
    ????? ???? ??????: ? +70??
    ?? ??? ?? (????? 5 ??.): ? +160??
    ????? ?????? ?? ?????: ? +5??   '

Input:

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    $string = iconv("UTF-16", "UTF-8//TRANSLIT//IGNORE", $string);
    debug($string);

Output:

췮㌰〯㔰ざഊ죱㈰〰嘍્⃰⃲㨠‫㜰냑ഊ쿰⃱밠⣭㔠⤺⃤⬱㘰냑ഊ췠볭

Input:

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    $string = iconv("ISO-8859-5", "UTF-8//TRANSLIT//IGNORE", $string);
    debug($string);

Output:

    Эюьшэрыхэ эряюэ:300/500V
    Шёяшђхэ эряюэ:2000V
    ЭрМтшёюър №рсюђэр ђхьях№рђѓ№р: фю +70Аб
    Я№ш ъ№рђюъ ёяюМ (эрМьэюуѓ 5 ёхъ.): фю +160Аб
    ЭрМэшёър ђхьях№рђѓ№р я№ш шэёђрырішМр: фю +5Аб

Now that I tested multiple files, I don't think the input file is Unicode encoded anymore. I succeeded on reading my test file, but on the one that matters (and I don't know the encoding of) still nothing. So I changed the question, the encoding seems to be undefined still.

A little bit more for clearance. I can open this file and see it normally in notepad. It contains cyrillic characters that make this problem.

Скач от
  • 212
  • 2
  • 14
  • How are you checking what do you get? By `echo`? Is your `php` file unicode then, and is it outputting unicode? Maybe your error is not in `file_get_contents` but in the method by which you are checking your data. – dkasipovic Apr 09 '14 at 13:04
  • I'm using CakePHP, and i call debug($result), which outputs something similar to var_dump. I tested the code and I can get any other content, but when I try a Unicode saved txt file it breaks. – Скач от Apr 09 '14 at 13:21
  • Breaks or gets wrong characters? – dkasipovic Apr 09 '14 at 13:22
  • if I don't do anything to the string, it breaks. If I do mb_convert_encoding it returns question marks instead of cyrillic characters, and if I do iconv, it returns wrong characters. – Скач от Apr 09 '14 at 13:23
  • Can you post examples of what you have tried, which one breaks, which one gives wrong characters, etc. – dkasipovic Apr 09 '14 at 13:24
  • Question edited. I also replaced the link with the original file that I need to decode as it turns out, it is different from my test file. – Скач от Apr 09 '14 at 13:44

1 Answers1

2

The file is encoded in CP1251 a.k.a. MS-CYRL a.k.a. "Cyrillic (Windows)".

$string = file_get_contents($path);
$string = iconv('CP1251', 'UTF-8', $string);

How did I figure this out? Opened the file in a text editor and tried a few relevant encodings until it looked right. There's hardly anything else you can do if the file encoding is unknown.

deceze
  • 510,633
  • 85
  • 743
  • 889