php merging txt files, issue with encoding

Question

I found this code on stackoverflow, from user @Attgun:

link: merge all files in directory to one text file

<?php

//Name of the directory containing all files to merge
$Dir = "directory";

//Name of the output file
$OutputFile = "filename.txt";

//Scan the files in the directory into an array
$Files = scandir ($Dir);

//Create a stream to the output file
$Open = fopen ($OutputFile, "w"); //Use "w" to start a new output file from 
zero. If you want to increment an existing file, use "a".

//Loop through the files, read their content into a string variable and 
write it to the file stream. Then, clean the variable.

foreach ($Files as $k => $v) {
    if ($v != "." AND $v != "..") {
        $Data = file_get_contents ($Dir."/".$v);
        fwrite ($Open, $Data);
    }
    unset ($Data);
}

//Close the file stream
fclose ($Open);
?>

The code works right but when it is merging, php inserts a character in the beginning of every file copied. The file encoding i am using is UCS-2 LE. I can view that character when i change the encoding to ANSI.

My problem is that i can't use another encoding than UCS-2 LE.

Can someone help me with this problem?

Edit: I don't wan't to change the file encoding. I want keep the same encoding without PHP add another character.

Sam Onela, no mate, is not duplicate, because here i want keep current encoding ( UCS-2 LE ). — MimisK, Jun 28 '17 at 17:25
Those characters are probably the Unicode BOM (byte order marker). Just strip them from all files but the first one. — Álvaro González, Jun 28 '17 at 17:33
Yeah I also considered it was the BOM - You may need a solution like [this](https://stackoverflow.com/a/15423899/1575353) — Sᴀᴍ Onᴇᴌᴀ, Jun 28 '17 at 17:35
But this character doesn't exist in the files. They are being created after the merging in the final file. :/ — MimisK, Jun 28 '17 at 17:37
They do exist: any Unicode-aware text editor will just process them properly. — Álvaro González, Jun 28 '17 at 17:38
This "final" file is being read by a compiler (nasc) that appears as an error. If i manually copy/paste to a new file then i have a successful result. — MimisK, Jun 28 '17 at 17:47
@AlexHowansky i try with your solution, but i do something wrong with path. Need absolute path? — MimisK, Jun 28 '17 at 17:53
@AlexHowansky i use your solution, and that almost work... For windows is "type" command. This merge all files, but change encoding in all files to UTF-8. Thank you anyway for your help. I use that solution for sure in another case! — MimisK, Jun 28 '17 at 18:10

Álvaro González · Answer 1 · 2017-06-29T08:52:31.857

Most PHP string functions are encoding-agnostic. They merely see strings as a collection of bytes. You may append a b to the fopen() call in order to be sure that line feeds are not mangled but nothing in your code should change the actual encoding.

UCS-2 (as well as its successor UTF-16 and some other members of the UTF family) is a special case because the Unicode standard defines two possible directions to print the individual bytes that conform a multi-byte character (that has the fancy name of endianness), and such direction is determined by the presence of the byte order mark character, followed by a variable number of bytes that depends on the encoding and determine the endianness of the file.

Such prefix is what prevents raw file concatenation from working. However, it's a still a pretty simple format. All that's needed is removing the BOM from all files but the first one.

To be honest, I couldn't find what the BOM is for UCS-2 (it's a obsolete encoding and it's no longer present in most Unicode documentation) but since you have several samples you should be able to see it yourself. Making the assumption that it's the same as in UTF-16 (FF FE) you'd just need to omit two bytes, e.g.:

$Data = file_get_contents ($Dir."/".$v);
fwrite ($Open, substr($Data, 2));

I've composed a little self-contained example. I don't have any editor that's able to handle UCS-2 so I've used UTF-16 LE. The BOM is 0xFFFF (you can inspect your BOM with an hexadecimal editor like hexed.it):

file_put_contents('a.txt', hex2bin('FFFE6100'));
file_put_contents('b.txt', hex2bin('FFFE6200'));

$output = fopen('all.txt', 'wb');

$first = true;
foreach (scandir(__DIR__) as $position => $file) {
    if (pathinfo($file, PATHINFO_EXTENSION)==='txt' && $file!=='all.txt') {
        $data = file_get_contents($file);
        fwrite($output, $first ? $data : substr($data, 2));
        $first = false;
    }
}
fclose($output);

var_dump(
    bin2hex(file_get_contents('a.txt')),
    bin2hex(file_get_contents('b.txt')),
    bin2hex(file_get_contents('all.txt'))
);

string(8) "fffe6100"
string(8) "fffe6200"
string(12) "fffe61006200"

As you can see, we end up with a single BOM on top and no other byte has been changed. Of course, this assumes that all your text files have the same encoding the encoding is exactly the one you think.

Unfortunately this version ruins the whole file encoding. Nevertheless thank you for your effort to help me! — MimisK, Jun 28 '17 at 18:47
Then you are either applying the fix incorrectly or your initial assumption that all files share the same encoding is wrong (in fact UCS-2 is pretty outdated so it's strange someone's still using it in 2017). Trust me: PHP is not JavaScript, PHP strings are binary-safe streams. — Álvaro González, Jun 29 '17 at 08:11

score 0 · Answer 2 · answered Jun 28 '17 at 18:54

0

@AlexHowansky motivated me to search for an other way.

The solution that it seems to work without messing with file encoding is this :

bat file :

@echo on
copy *.txt all.txt
@pause

Now the final file keeps the encoding from the files that reads. My compiler doesn't show any error message like before!

answered Jun 28 '17 at 18:54

MimisK

39
4

This works because the [copy command](https://ss64.com/nt/copy.html) handles files as plain text by default (vs binary) and it's smart enough to autodetect encoding by BOM. – Álvaro González Jun 29 '17 at 08:23

php merging txt files, issue with encoding

2 Answers2