How can I write a file in UTF-8 format?

Question

I have bunch of files that are not in UTF-8 encoding and I'm converting a site to UTF-8 encoding.

I'm using simple script for files that I want to save in UTF-8, but the files are saved in old encoding:

header('Content-type: text/html; charset=utf-8');
mb_internal_encoding('UTF-8');
$fpath = "folder";
$d = dir($fpath);
while (False !== ($a = $d->read()))
{
    if ($a != '.' and $a != '..')
    {
        $npath = $fpath . '/' . $a;

        $data = file_get_contents($npath);

        file_put_contents('tempfolder/' . $a, $data);
    }
}

How can I save files in UTF-8 encoding?

score 99 · Answer 1 · edited Jul 15 '12 at 15:16

99

Add BOM: UTF-8

file_put_contents($myFile, "\xEF\xBB\xBF".  $content);

edited Jul 15 '12 at 15:16

Musa

96,336
17
118
137

answered Jan 28 '12 at 19:24

user956584

5,316
3
40
50

5

This should be the accepted answer... short and sweet, and works! – David R. Dec 06 '17 at 09:44
5

There is a distinction between creating a file recognized as an UTF-8 and converting the content which goes to that file. A plain text file without special characters has the same content as UTF-8 without BOM, also parsers which might be processing your text have an encoding option. PHP uses UTF-8 itself, so if you see text OK but file does not seem to be UTF-8, chances are the text is UTF-8 and adding BOM is all you need. But, it's not converting. This problem is seen often, because PHP is lazy adding BOM, but it itself is expecting it on input. – papo Jan 06 '19 at 12:32
I had a slightly different issue than the OP, but this solved my issue. I didn't use file_put_contents, but instead used header to download the file immediately. The data was already in UTF-8 in the database, but it wasn't working the CSV download. This worked great. Thank you. – Andy Borgmann Apr 09 '21 at 16:25
The problem with this solution is the BOM sits there in front of the first character in the file and can interfere with text operations such as sorting lines, cursor movement around the BOM, etc. – Kevin Berry Jan 06 '23 at 03:52

score 54 · Accepted Answer · edited Feb 21 '22 at 21:33

54

file_get_contents() and file_put_contents() will not magically convert encoding.

You have to convert the string explicitly; for example with iconv() or mb_convert_encoding().

Try this:

$data = file_get_contents($npath);
$data = mb_convert_encoding($data, 'UTF-8', 'OLD-ENCODING');
file_put_contents('tempfolder/' . $a, $data);

Or alternatively, with PHP's stream filters:

$fd = fopen($file, 'r');
stream_filter_append($fd, 'convert.iconv.UTF-8/OLD-ENCODING');
stream_copy_to_stream($fd, fopen($output, 'w'));

edited Feb 21 '22 at 21:33

Peter Mortensen

30,738
21
105
131

answered Jan 29 '11 at 21:09

Arnaud Le Blanc

98,321
23
206
194

What is $a variable on line 3 of first example? – Jaakko Uusitalo Apr 13 '18 at 06:38
In case of using stream_filter_append: OLD-ENCODING/UTF-8 – zooks Aug 31 '19 at 14:56

score 29 · Answer 3 · edited Feb 21 '22 at 21:38

29

<?php
    function writeUTF8File($filename, $content) {
        $f = fopen($filename, "w");
        # Now UTF-8 - Add byte order mark
        fwrite($f, pack("CCC", 0xef, 0xbb, 0xbf));
        fwrite($f, $content);
        fclose($f);
    }
?>

edited Feb 21 '22 at 21:38

Peter Mortensen

30,738
21
105
131

answered Feb 28 '13 at 21:47

Alaa

4,471
11
50
67

I was trying to create a php download script in order to use UTF-8 for danish characters, this is what it was missing, ty – cuzzea May 29 '13 at 06:42
1

It also works to UTF-16 but with that bytes: fwrite($f, pack("CC",0xff,0xfe)); – TSr Feb 12 '16 at 18:08
@tSr you're a life saver – Aseel Ashraf Nov 22 '21 at 16:02

score 5 · Answer 4 · answered Jan 29 '11 at 21:09

5

Iconv to the rescue.

answered Jan 29 '11 at 21:09

Dennis Kreminsky

2,117
15
23

Can you elaborate? – Peter Mortensen Feb 19 '22 at 22:51

score 3 · Answer 5 · edited Feb 21 '22 at 21:35

3

On Unix/Linux, a simple shell command could be used alternatively to convert all files from a given directory:

recode L1..UTF8 dir/*

It could be started via PHP's exec() as well.

edited Feb 21 '22 at 21:35

Peter Mortensen

30,738
21
105
131

answered Jan 30 '11 at 06:01

mario

144,265
20
237
291

Didn't know about this command. Thanks! I use Linux even as workstation, all of my local servers are on Linux. And what does L1.. in the command means? – Starmaster Jan 30 '11 at 23:15
@Starmaster: L1 is shorthand for Latin-1, the source charset. – mario Jan 31 '11 at 04:33

score 2 · Answer 6 · answered Jan 26 '16 at 16:18

2

//add BOM to fix UTF-8 in Excel
fputs($fp, $bom =( chr(0xEF) . chr(0xBB) . chr(0xBF) ));

I got this line from Cool

answered Jan 26 '16 at 16:18

Du Peng

351
3
3

This is similar to [Alaa's answer](https://stackoverflow.com/questions/4839402/how-can-i-write-a-file-in-utf-8-format/15146274#15146274). – Peter Mortensen Feb 21 '22 at 21:43

score 0 · Answer 7 · answered Feb 12 '13 at 16:17

0

If you want to use recode recursively, and filter for type, try this:

find . -name "*.html" -exec recode L1..UTF8 {} \;

answered Feb 12 '13 at 16:17

Aitor

3,309
2
27
32

score 0 · Answer 8 · edited Feb 21 '22 at 21:56

0

I put all together and got easy way to convert ANSI text files to "UTF-8 No Mark":

function filesToUTF8($searchdir, $convdir, $filetypes) {
  $get_files = glob($searchdir . '*{' . $filetypes . '}', GLOB_BRACE);
  foreach($get_files as $file) {
    $expl_path = explode('/', $file);
    $filename = end($expl_path);
    $get_file_content = file_get_contents($file);
    $new_file_content = iconv(mb_detect_encoding($get_file_content, mb_detect_order(), true), "UTF-8", $get_file_content);
    $put_new_file = file_put_contents($convdir.$filename, $new_file_content);
  }
}

Usage:

filesToUTF8('C:/Temp/', 'C:/Temp/conv_files/', 'php,txt');

edited Feb 21 '22 at 21:56

Peter Mortensen

30,738
21
105
131

answered Oct 05 '17 at 12:26

Le Inc

11

What is "No Mark"? Without a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark)? – Peter Mortensen Feb 21 '22 at 21:44
Why "`php,txt`"? Shouldn't it be "`php.txt`"? – Peter Mortensen Feb 21 '22 at 21:57

score 0 · Answer 9 · edited Feb 21 '22 at 22:16

This is a quite useful question. I think that my solution on Windows 10 PHP 7 is rather useful for people who have yet some UTF-8 conversion trouble.

Here are my steps. The PHP script calling the following function, here in utfsave.php must have UTF-8 encoding itself, and this can be easily done by conversion on UltraEdit.

In the utfsave.php file, we define a function calling PHP fopen($filename, "wb"), i.e., it's opened in both w write mode, and especially with b in binary mode.

<?php
//
//  UTF-8 编码:
//
// fnc001: save string as a file in UTF-8:
// The resulting file is UTF-8 only if $strContent is,
// with French accents, Chinese ideograms, etc.
//
function entSaveAsUtf8($strContent, $filename) {
  $fp = fopen($filename, "wb");
  fwrite($fp, $strContent);
  fclose($fp);
  return True;
}

//
// 0. write UTF-8 string in fly into UTF-8 file:
//
$strContent = "My string contains UTF-8 chars ie 鱼肉酒菜 for un été en France";

$filename = "utf8text.txt";

entSaveAsUtf8($strContent, $filename);


//
// 2. convert CP936 ANSI/OEM - Chinese simplified GBK file into UTF-8 file
//
//   CP936: <https://en.wikipedia.org/wiki/Code_page_936_(Microsoft_Windows)>
//   GBK:   <https://en.wikipedia.org/wiki/GBK_(character_encoding)> 
//
$strContent = file_get_contents("cp936gbktext.txt");
$strContent = mb_convert_encoding($strContent, "UTF-8", "CP936");


$filename = "utf8text2.txt";

entSaveAsUtf8($strContent, $filename);

?>

The content of source file cp936gbktext.txt:

>>Get-Content cp936gbktext.txt
My string contains UTF-8 chars ie 鱼肉酒菜 for un été en France 936 (ANSI/OEM - chinois simplifié GBK)

Running utf8save.php on Windows 10 PHP, thus created utf8text.txt, utf8text2.txt files will be automatically saved in UTF-8 format.

With this method, the BOM characters are not required. The BOM solution is bad because it causes troubles when we do sourcing of an SQL file for MySQL for example.

It's worth noting that I failed making work file_put_contents($filename, utf8_encode($mystring)); for this purpose.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

If you don't know the encoding of the source file, you can list encodings with PHP:

print_r(mb_list_encodings());

This gives a list like this:

Array
(
  [0] => pass
  [1] => wchar
  [2] => byte2be
  [3] => byte2le
  [4] => byte4be
  [5] => byte4le
  [6] => BASE64
  [7] => UUENCODE
  [8] => HTML-ENTITIES
  [9] => Quoted-Printable
  [10] => 7bit
  [11] => 8bit
  [12] => UCS-4
  [13] => UCS-4BE
  [14] => UCS-4LE
  [15] => UCS-2
  [16] => UCS-2BE
  [17] => UCS-2LE
  [18] => UTF-32
  [19] => UTF-32BE
  [20] => UTF-32LE
  [21] => UTF-16
  [22] => UTF-16BE
  [23] => UTF-16LE
  [24] => UTF-8
  [25] => UTF-7
  [26] => UTF7-IMAP
  [27] => ASCII
  [28] => EUC-JP
  [29] => SJIS
  [30] => eucJP-win
  [31] => EUC-JP-2004
  [32] => SJIS-win
  [33] => SJIS-Mobile#DOCOMO
  [34] => SJIS-Mobile#KDDI
  [35] => SJIS-Mobile#SOFTBANK
  [36] => SJIS-mac
  [37] => SJIS-2004
  [38] => UTF-8-Mobile#DOCOMO
  [39] => UTF-8-Mobile#KDDI-A
  [40] => UTF-8-Mobile#KDDI-B
  [41] => UTF-8-Mobile#SOFTBANK
  [42] => CP932
  [43] => CP51932
  [44] => JIS
  [45] => ISO-2022-JP
  [46] => ISO-2022-JP-MS
  [47] => GB18030
  [48] => Windows-1252
  [49] => Windows-1254
  [50] => ISO-8859-1
  [51] => ISO-8859-2
  [52] => ISO-8859-3
  [53] => ISO-8859-4
  [54] => ISO-8859-5
  [55] => ISO-8859-6
  [56] => ISO-8859-7
  [57] => ISO-8859-8
  [58] => ISO-8859-9
  [59] => ISO-8859-10
  [60] => ISO-8859-13
  [61] => ISO-8859-14
  [62] => ISO-8859-15
  [63] => ISO-8859-16
  [64] => EUC-CN
  [65] => CP936
  [66] => HZ
  [67] => EUC-TW
  [68] => BIG-5
  [69] => CP950
  [70] => EUC-KR
  [71] => UHC
  [72] => ISO-2022-KR
  [73] => Windows-1251
  [74] => CP866
  [75] => KOI8-R
  [76] => KOI8-U
  [77] => ArmSCII-8
  [78] => CP850
  [79] => JIS-ms
  [80] => ISO-2022-JP-2004
  [81] => ISO-2022-JP-MOBILE#KDDI
  [82] => CP50220
  [83] => CP50220raw
  [84] => CP50221
  [85] => CP50222
)

If you cannot guess, you try one by one, as mb_detect_encoding() cannot do the job easily.

score -7 · Answer 10 · answered Feb 28 '16 at 11:33

-7

Open your files in windows notebook
Change the encoding to be an UTF-8 encoding
Save your file
Try again! :O)

answered Feb 28 '16 at 11:33

Kjell E. Svendsen

3

How can I write a file in UTF-8 format?

10 Answers10

Linked