11

I am creating a file using php fwrite() and I know all my data is in UTF8 ( I have done extensive testing on this - when saving data to db and outputting on normal webpage all work fine and report as utf8.), but I am being told the file I am outputting contains non utf8 data :( Is there a command in bash (CentOS) to check the format of a file?

When using vim it shows the content as:

Donâ~@~Yt do anything .... Itâ~@~Ys a great site with everything....Weâ~@~Yve only just launched/

Any help would be appreciated: Either confirming the file is UTF8 or how to write utf8 content to a file.

UPDATE

To clarify how I know I have data in UTF8 i have done the following:

  1. DB is set to utf8 When saving data
  2. to database I run this first:

    $enc = mb_detect_encoding($data);

    $data = mb_convert_encoding($data, "UTF-8", $enc);

  3. Just before I run fwrite i have checked the data with Note each piece of data returns 'IS utf-8'

    if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'NOT UTF-8'; else print 'IS utf-8';

Thanks!

Lizard
  • 43,732
  • 39
  • 106
  • 167
  • See [this question](http://stackoverflow.com/q/6285756/367456) for a PHP function that checks UTF-8 bit by bit of a string. It's called `can_be_valid_utf8_statemachine()`. It's at least more accurate in it's result than your strlen comparison approach. – hakre Jun 13 '11 at 22:29
  • **Before** you encode something to UTF-8 you should ensure if it's UTF-8, because after encoding to UTF-8 it will be always UTF-8. So you just can not check that later. – hakre Jun 13 '11 at 22:40

9 Answers9

26

If you know the data is in UTF8 than you want to set up the header.

I wrote a solution answering to another tread.

The solution is the following: As the UTF-8 byte-order mark is \xef\xbb\xbf we should add it to the document's header.

<?php
function writeStringToFile($file, $string){
    $f=fopen($file, "wb");
    $file="\xEF\xBB\xBF".$file; // this is what makes the magic
    fputs($f, $string);
    fclose($f);
}
?>

You can adapt it to your code, basically you just want to make sure that you write a UTF8 file (as you said you know your content is UTF8 encoded).

Lizard
  • 43,732
  • 39
  • 106
  • 167
Florin Sima
  • 1,495
  • 17
  • 13
  • UTF8 BOM meaning 0xEF,0xBB,0xBF, which is precisely what I suggested. Now you obviously can create a file as UTF-8 by changing settings in your IDE. But the same thing can be achieved using just PHP. – Florin Sima Sep 10 '12 at 15:30
  • Also for the records and for the benefit of those that may stumble on this solution, this is not the line that makes the magic `$file="\xEF\xBB\xBF".$file;` as @FlorinSima had stated. This line only adds BOM to the file (UTF-8 with BOM). Rather, the line that makes the file UTF-8 is `$f=fopen($file, "wb");` – Felix Imafidon Apr 21 '17 at 11:02
  • This solution did not help me, however this one (accepted answer) worked: https://stackoverflow.com/questions/21988581/write-utf-8-characters-to-file-with-fputcsv-in-php – charelf Jun 21 '19 at 09:39
  • any universal way to set the header based on knowing the charset's name? – Fanky Jan 14 '20 at 22:56
6

fwrite() is not binary safe. That means, that your data - be it correctly encoded or not - might get mangled by this command or it's underlying routines.

To be on the safe side, you should use fopen() with the binary mode flag. that's b. Afterwards, fwrite() will safe your string data "as-is", and that is in PHP until now binary data, because strings in PHP are binary strings.

Background: Some systems differ between text and binary data. The binary flag will explicitly command PHP on such systems to use the binary output. When you deal with UTF-8 you should take care that the data does not get's mangeled. That's prevented by handling the string data as binary data.

However: If it's not like you told in your question that the UTF-8 encoding of the data is preserved, than your encoding got broken and even binary safe handling will keep the broken status. However, with the binary flag you still ensure that this is not the fwrite() part of your application that is breaking things.

It has been rightfully written in another answer here, that you do not know the encoding if you have data only. However, you can validate data if it validates UTF-8 encoding or not, so giving you at least some chance to check the encoding. A function in PHP which does this I've posted in a UTF-8 releated question so it might be of use for you if you need to debug things: Answer to: SimpleXML and Chinese look for can_be_valid_utf8_statemachine, that's the name of the function.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • This doesn't really address the question. He's not using any encoding that could be adversely affected by changing the new line characters. In fact, most of the commonly used encodings are either ASCII compatible or at least preserve the ASCII characters until a code point well above the C0 block. – Artefacto Jun 13 '11 at 22:30
  • The state of binary safeness of a function _is_ important when you deal with encodings. `fwrite()` most certainly is not the source of the problem, but IMHO worth to note in the context of the question as the OP is unsure if fwrite is a source of error. However I'm with you that I do not believe that it's actually the source of error. Therefore I left some hints how to do a better check if the string data is acutally UTF-8 encoded or could be at least. – hakre Jun 13 '11 at 22:35
3
//add BOM to fix UTF-8 in Excel
fputs($fp, $bom =( chr(0xEF) . chr(0xBB) . chr(0xBF) ));

I find this piece works for me :)

Du Peng
  • 351
  • 3
  • 3
2
$handle = fopen($file,"w");
fwrite($handle, pack("CCC",0xef,0xbb,0xbf));
fwrite($handle,$file); 
fclose($handle);
steffanjj
  • 881
  • 9
  • 10
2

The problem is your data is double encoded. I assume your original text is something like:

Don’t do anything

with , i.e., not the straight apostrophe, but the right single quotation mark.

If you write a PHP script with this content and encoded in UTF-8:

<?php
//File in UTF-8
echo utf8_encode("Don’t"); //this will double encode

You will get something similar to your output.

Artefacto
  • 96,375
  • 17
  • 202
  • 225
  • i haven't got utf8_encode anywhere and when I do add it then it gets even worse. – Lizard Jun 13 '11 at 22:31
  • There is no thing like double encode. There is always one encoding, you can not double an encoding of a string ;) – hakre Jun 13 '11 at 22:36
  • @hakre Sure, if we want to be exact, I meant an ASCII/ISO-8859-1/whatever to UTF-8 conversion was applied to data that was already encoded in UTF-8. – Artefacto Jun 13 '11 at 23:58
  • @Lizard I never said you had `utf8_encode`. I was just showing what kind of corruption you were getting. Namely, something is converting your data to UTF-8 when it's already in UTF-8. – Artefacto Jun 13 '11 at 23:58
0

I know all my data is in UTF8 - wrong.
Encoding it's not the format of a file. So, check charset in headers of the page, where you taking data from:
header("Content-type: text/html; charset=utf-8;");
And check if data really in multi-byte encoding:
if (strlen($data)==mb_strlen($data, 'UTF-8')) print 'not UTF-8';
else print 'utf-8';

OZ_
  • 12,492
  • 7
  • 50
  • 68
  • I know all my data is in UTF8 - have done extensive testing on this - when saving data to db and outputting on normal webpage all work fine and report as utf8. – Lizard Jun 13 '11 at 21:40
  • It doesn't mean your data is in UTF-8. Output is output, input is input. When saving data to db you can convert it. Also, files can't have `encoding` property, only data can have encoding. So, if your file contain data in non-utf encoding, it absolutely means that data was in wrong encoding. To ensure - use code from my answer. – OZ_ Jun 13 '11 at 21:43
  • i am confident it is utf8. I have also tested using your code and every line I am writing to the file says 'utf-8' there is NO mention of 'not UTF-8' – Lizard Jun 13 '11 at 21:46
  • Then problems in vim, maybe some locales was not installed. In debian based, write `dpkg-reconfigure locales` in console and ensure these locales are selected: en.GB UTF-8, en.US UTF-8, en.US UTF-8. – OZ_ Jun 13 '11 at 21:53
  • Using `mb_detect_encoding` it's very wrong way. You should be sure that data in UTF-8 just because all headers was sent correctly. `mb_detect_encoding` it's useless function, don't use it. Also, if you checking `if (strlen...` after that converting - it will not work to. – OZ_ Jun 13 '11 at 21:56
  • OZ_: Your code will return `not UTF-8` for the string `A`. I'm pretty sure you have an error in your routine. – hakre Jun 13 '11 at 23:14
  • @hakre, for `A` symbol this code should return `not UTF-8`. I'm pretty sure you don't understand how this code works. – OZ_ Jun 14 '11 at 08:52
  • Well `A` is UTF-8. Perhaps I do not understand what your code is for, but probably you can explain me why it returns `not UTF-8` for strings that are `UTF-8`. – hakre Jun 14 '11 at 09:44
0

There is some reason: first you get information from database it is not utf-8. if you sure that was true use this ,I always use this and it work :

$file= fopen('../logs/logs.txt','a');
fwrite($file,PHP_EOL."_____________________output_____________________".PHP_EOL);
fwrite($file,print_r($value,true));
-1

The only thing I had to do is add a UTF8 BOM to the CSV, the data was correct but the file reader (external application) couldn't read the file properly without the BOM

Lizard
  • 43,732
  • 39
  • 106
  • 167
-3

Try this simple method that is more useful and add to the top of the page before tag <body> :

<head>
  <meta charset="utf-8">
</head>
  • 1
    Your response is not valid on this case, because and tags apply only on client side (HTML) and the problem are on server side (PHP) (And not with received information). – Sakura Kinomoto Dec 05 '18 at 23:08