3

I've got CSV files in UTF-16LE encoding with a BOM. They might be pretty big, so I don't really like the idea of reading whole files in memory. How do I go about reading them?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
x-yuri
  • 16,722
  • 15
  • 114
  • 161

2 Answers2

4

Read it line by line and use mb_convert_encoding():

$decoded_line = mb_convert_encoding ($line, "UTF-8", "UTF-16LE");

You can choose any destination encoding, but I'm assuming you want to work with UTF-8 strings, which are the most common nowadays.

This function needs the mbstring extension to be enabled.

You can then pass the decoded line to the str_getcsv function which returns an array representing the current line.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
SirDarius
  • 41,440
  • 8
  • 86
  • 100
  • 3
    Another panda!!! Well problem is to read line by line, fgets() won't work with UTF16 – Adriano Repetti Dec 18 '14 at 16:33
  • 1
    I suppose using `stream_get_line` using the litteral UTF-16LE line feed character '0x00 0x0A' might work. – SirDarius Dec 18 '14 at 16:39
  • I believe your solution would be perfect if not for `stream_get_line`'s `$length` parameter. Test code for [my `1.txt` file](http://stackoverflow.com/a/27551630/52499): `$fh = fopen(PRJ_ROOT . '1.txt', 'r'); fread($fh, 2); $s = stream_get_line($fh, 1000, "\x0d\x00\x0a\x00"); printf("%s\n", to_hex($s)); $s = stream_get_line($fh, 1000, "\x0d\x00\x0a\x00"); printf("%s\n", to_hex($s)); $s = stream_get_line($fh, 1000, "\x0d\x00\x0a\x00"); var_dump($s);` – x-yuri Dec 18 '14 at 17:18
  • Well the $length parameter can be seen as a safeguard so you don't end up using more memory than you want to. The documentation does not say if -1 or 0 hold significant values, which is odd considering there are other PHP functions that allow "unlimited" lengths by the use of such values. – SirDarius Dec 18 '14 at 17:25
1

Here's what I came up with:

class readutf16le_filter extends php_user_filter {
    function filter($in, $out, &$consumed, $closing) {
        while ($bucket = stream_bucket_make_writeable($in)) {
            # printf("filter: %s\n", to_hex($bucket->data));
            $bucket->data = iconv('UTF-16LE', 'UTF-8',
                strlen($bucket->data) && substr($bucket->data, 0, 2) == "\xff\xfe"
                    ? substr($bucket->data, 2)
                    : $bucket->data);
            $consumed += $bucket->datalen;
            stream_bucket_append($out, $bucket);
        }
        return PSFS_PASS_ON;
    }
}

stream_filter_register('readutf16le', 'readutf16le_filter');

$fh = fopen('1.txt', 'r');
stream_filter_append($fh, 'readutf16le');

$s = fgets($fh);
printf("%s\n", to_hex($s));

$s = fgets($fh);
printf("%s\n", to_hex($s));

$s = fgets($fh);
var_dump($s);

File 1.txt:

a
b

Output:

filter: ff fe 61 00 0d 00 0a 00 62 00 0d 00 0a 00
61 0d 0a
62 0d 0a
bool(false)

What I still don't like is that I don't see any way to detect the beginning of the file in the filter. However, it's unlikely to cause problems. Wikipedia says:

BOM use is optional, and, if used, should appear at the start of the text stream.

If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be only used as a BOM.

For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order. If encountered anywhere in such a text stream, U+FEFF is to be interpreted as a "zero width no-break space".

Probably this can be done with a stream wrapper. One can probably do fread($fh, 2); before appending a filter to the stream.

And the other possible issue is that strlen($bucket->data) might theoretically be an odd number. From what I can tell, PHP uses buffering and it's unlikely to run into a buffer of size being an odd number (usually they are powers of 2). But to accommodate such cases:

...
while ($bucket = stream_bucket_make_writeable($in)) {
    $data = strlen($bucket->data) ?
        substr($bucket->data, 0, floor(strlen($bucket->data) / 2) * 2) : '';
    $bucket->data = iconv('UTF-16LE', 'UTF-8',
        strlen($data) && substr($data, 0, 2) == "\xff\xfe"
            ? substr($data, 2)
            : $data);
    $consumed += strlen($data);
    stream_bucket_append($out, $bucket);
    ...

I don't know how to reproduce this though.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
x-yuri
  • 16,722
  • 15
  • 114
  • 161