0

I am being sent a csv file that is tab delimited. Here is a sample of what I see:

Invoice: Invoice Date   Account: Name   Bill To: First Name Bill To: Last Name  Bill To: Work Email Rate Plan Charge: Name  Subscription: Device Serial Number
2021-03-10  Test Company    Wally   Kolcz   test@test.com   Sample plan A0H1234567890A

I wrote a script to open, read and loop over the values but I get weird stuff after:

if (($handle = fopen($user_file, "r")) !== FALSE) {
            while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {
                if($line >1 && isset($data[1])){
                    
                    $user = [
                        'EmailAddress' => $data[4],
                        'Name' => $data[2].' '.$data[3],
                    ];
                }

                $line++;
            }
            fclose($handle);
        }

Here is what I get when I dump the first line.

array:7 [▼
  0 => b"ÿþI\x00n\x00v\x00o\x00i\x00c\x00e\x00:\x00 \x00I\x00n\x00v\x00o\x00i\x00c\x00e\x00 \x00D\x00a\x00t\x00e\x00"
  1 => "\x00A\x00c\x00c\x00o\x00u\x00n\x00t\x00:\x00 \x00N\x00a\x00m\x00e\x00"
  2 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00F\x00i\x00r\x00s\x00t\x00 \x00N\x00a\x00m\x00e\x00"
  3 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00L\x00a\x00s\x00t\x00 \x00N\x00a\x00m\x00e\x00"
  4 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00W\x00o\x00r\x00k\x00 \x00E\x00m\x00a\x00i\x00l\x00"
  5 => "\x00R\x00a\x00t\x00e\x00 \x00P\x00l\x00a\x00n\x00 \x00C\x00h\x00a\x00r\x00g\x00e\x00:\x00 \x00N\x00a\x00m\x00e\x00"
  6 => "\x00S\x00u\x00b\x00s\x00c\x00r\x00i\x00p\x00t\x00i\x00o\x00n\x00:\x00 \x00D\x00e\x00v\x00i\x00c\x00e\x00 \x00S\x00e\x00r\x00i\x00a\x00l\x00 \x00N\x00u\x00m\x00b\x00e\x00r\x00 ◀"
]

I tried adding:

header('Content-Type: text/html; charset=UTF-8');
$data = array_map("utf8_encode", $data);
setlocale(LC_ALL, 'en_US.UTF-8');

And when I dump mb_detect_encoding($data[2]), I get 'ASCII'...

Any way to fix this so I don't have to manually update the file each time I receive it? Thanks!

Wally Kolcz
  • 1,604
  • 3
  • 24
  • 45

3 Answers3

4

Looks like the file is in UTF-16 (every other byte is null).

You probably need to convert the whole file with something like mb_convert_encoding($data, "UTF-8", "UTF-16");

But you can't really use fgetcsv() in that case…

Andrea
  • 19,134
  • 4
  • 43
  • 65
4

As @Andrea already mentioned, your data is encoded as UTF-16LE and you need to convert it to an encoding compatible with what you want to do. That said, it is possible to do in-flight with PHP stream filters.

abstract class TranslateCharset extends php_user_filter {

    protected $in_charset, $out_charset;
    private $buffer = '';
    private $total_consumed = 0;

    public function filter($in, $out, &$consumed, $closing) {
        $output = '';

        while ($bucket = stream_bucket_make_writeable($in)) {
            $input = $this->buffer . $bucket->data;
            for( $i=0, $p=0; ($c=mb_substr($input, $i, 1, $this->in_charset)) !== ""; ++$i, $p+=strlen($c) ) {
                $output .= mb_convert_encoding($c, $this->out_charset, $this->in_charset);
            }
            $this->buffer = substr($input, $p);
            $consumed += $p;
        }

        // this means that  there's unconverted data at the end of the bridage.
        if( $closing && strlen($this->buffer) > 0 ) {
            $this->raise_error( sprintf(
                "Likely encoding error at offset %d in input stream, subsequent data may be malformed or missing.",
                $this->total_consumed += $consumed)
            );
            $consumed += strlen($this->buffer);
            // give it the ol' college try
            $output .= mb_convert_encoding($this->buffer, $this->out_charset, $this->in_charset);
        }

        $this->total_consumed += $consumed;

        if ( ! isset($bucket) ) {
            $bucket = stream_bucket_new($this->stream, $output);
        } else {
            $bucket->data = $output;
        }
        stream_bucket_append($out, $bucket);
        return PSFS_PASS_ON;
    }

    protected function raise_error($message) {
        user_error( sprintf(
            "%s[%s]: %s",
            __CLASS__, get_class($this), $message
        ), E_USER_WARNING);
    }

}

class UTF16LEtoUTF8 extends TranslateCharset {
    protected $in_charset = 'UTF-16LE';
    protected $out_charset = 'UTF-8';
}

stream_filter_register('UTF16LEtoUTF8', 'UTF16LEtoUTF8');

// properly-encoded UTF-16BE example input "Invoice:,a"
$in = "\xFE\xFFI\x00n\x00v\x00o\x00i\x00c\x00e\x00:\x00,\x00a\x00";

// prep example pipe, in practice this would simple be your fopen() call.
$fh = fopen('php://memory', 'rwb+');
fwrite($fh, $in);
rewind($fh);

// skip BOM
fseek($fh, 2);
stream_filter_append($fh, 'UTF16LEtoUTF8', STREAM_FILTER_READ);

var_dump(fgetcsv($fh, 4096));

Output:

array(2) {
  [0]=>
  string(8) "Invoice:"
  [1]=>
  string(1) "a"
}

In practice there is no "magic bullet" to detect the encoding of an input file or string. In this case there is a Byte Order Mark [BOM] of 0xFF 0xFE that denotes that this in UTF-16LE but the BOM is frequently omitted, or may simply occur naturally at the beginning of any arbitrary string, or is simply not required for most encodings, or is simply not used by whoever encoded the data.

That last bit is the exact reason why everyone should avoid the utf8_encode() and utf8_decode() functions like the plague, because they simply assume that you only ever want to go between UTF-8 and ISO-8859-1 [western european], and make no effort to avoid corrupting your data when used incorrectly because they can't possibly know any better.

TLDR: You must explicitly know the encoding of your input data, or you're going to have a bad time.

Edit: Since I've gone and put a proper spitshine on this I've put it up as a Composer package, in case anyone else needs something like this.

https://packagist.org/packages/wrossmann/costrenc

Sammitch
  • 30,782
  • 7
  • 50
  • 77
  • How was this question closed without me closing it? – Wally Kolcz Mar 16 '21 at 01:30
  • 1
    Someone with mod powers marked it as a duplicate, but I un-marked it – Sammitch Mar 16 '21 at 02:05
  • 2
    It looks like UTF-16LE to me. It may seem to be UTF-16BE if you split it by ASCII newlines first, but that is corrupting it. – Andrea Mar 16 '21 at 17:59
  • @Andrea Ahh, you're right. I'll edit the answer. – Sammitch Mar 16 '21 at 18:06
  • The buckets data as per the stream may not map well on `mb_convert_encoding` which operates on binary strings. As a multi-byte character encoding scheme (here: UTF-16.,UTF-16BE or UTF-16LE encoding scheme), the bucket data, as it is binary data, might end intermittent the binary encoding sequence over multiple bytes (octets). I wonder if `mb_*` offers a more "rolling" approach to overcome such issues. For an Unicode UTF-16 dedicated stream filter BOM detection may appear reasonable, too. See as well: [convert.iconv.*](https://www.php.net/manual/en/filters.convert.php#filters.convert.iconv) – hakre Jul 09 '21 at 11:28
  • @hakre thanks for pointing that out. I've edited in a solution that should address partial sequences at the end of a bucket, saving them to a buffer and prepending them to the next bucket's data. Just don't have an encoding error in the middle of the stream or buffer everything after that until the end of the stream. – Sammitch Jul 09 '21 at 20:03
2

I ended up with is as working code:

 $f = file_get_contents($user_file);        
  $f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE');   
  $f = preg_split("/\R/", $f); 
  $f = array_map('str_getcsv', $f);
  $line = 0;


  foreach($f as $record){

    if($line !== 0 && isset($record[0])){
      $pieces = preg_split('/[\t]/',$record[0]);

      //My work here
    }
   }

Thank you everyone for your examples and suggestions!

Wally Kolcz
  • 1,604
  • 3
  • 24
  • 45