6

Update:

Preparing a bug report to the great people that make PHP 7 possible I revised my research once more and tried to melt it down to a few simple lines of code. While doing this I found that PHP itself is not the cause of the problem. I will share my results here when I'm done. Just so you know and don't possibly waste your time or something :)


Synopsis: PHP7 now seems able to write UTF-8 filenames but is unable to access them?

Preamble: I read about 10-15 articles here touching the subject but they did not help me solve the problem and they all are older than the PHP7 release. It seems to me that this is probably a new issue and I wonder if it might be a bug. I spent a lot of time experimenting with en-/decoding of the strings and trying to figure out a way to make it work - to no avail.

Good day everybody and greetings from Germany (insert shy not-my-native-language-remark here), I hope you can help me out with this new phenomenon I encountered. It seems to be "new" in the sense that it came with PHP 7.

I think most people working with PHP on a Windows system are very familiar with the problem of filenames and the transparent wrapper of PHP that manages access to files that have non-ASCII filenames (or windows-1252 or whatever is the system code page).

I'm not quite sure how to approach the subject and as you can see I'm not very experienced in composing questions so please don't rip my head off instantly. And yes I will strive to keep it short. Here we go:

First symptom: after updating to PHP7 I sometimes encountered problems with accessing files generated by my software. Sometimes it worked as usual, sometimes not. I found out the difference was that PHP7 now seems able to write UTF-8 filenames but is unable to access files with those names.

After generating said files on two separate "identical" systems (differing only in the PHP version) this is how the files are named on the hard drive:

PHP 5.5: Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip

PHP 7: Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip

Splendid, PHP 7 is capable of writing unicode-filenames on the HDD, and UTF-16 is used on windows afaik. Now the downside is that when I try to access those files for example with is_file() PHP 5.5 works but PHP 7 does not.

Consider this code snippet (note: I "hacked" into this function because it was the simplest way, it was not written for this purpose). This function gets called after a zip-file gets generated taking on the name of the customer and other values to determine a proper name. Those come out of the database. Database and internal encoding of PHP are both UTF-8. clearstatcache is per se not necessary but I included it to make things clearer. Important: Everything that happens is done with PHP7, no other entity is responsible for creating the zip-file. To be precise it is done with class ZipArchive. Actually it does not even matter that it is a zip-archive, the point is that the filename and the content of the file are created by PHP7 - successfully.

public static function downloadFileAsStream( $file )
{
    clearstatcache();
    print $file . "<br/>";
    var_dump(is_file($file));
    die();
}       

Output is:

D:/htdocs/otm/.data/_tmp/Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip
bool(false) 

So PHP7 is able to generate the file - they indeed DO exist on the harddrive and are legit and accessible and all - but is incapable of accessing them. is_file is not the only function that fails, file_exists() does too for example.

A little experiment with encoding conversion to give you a taste of the things I tried:

public static function downloadFileAsStream( $file )
{
    clearstatcache();
    print $file . "<br/>";
    print mb_detect_encoding($file, 'ASCII,UTF-16,windows-1252,UTF-8', false) . "<br/>";
    print mb_detect_encoding($file, 'ASCII,UTF-16,windows-1252,UTF-8', true) . "<br/>";

    if (($detectedEncoding = mb_detect_encoding($file, 'ASCII,UTF-16,windows-1252,UTF-8', true)) != 'windows-1252')
    {
        $file = mb_convert_encoding($file, 'UTF-16', $detectedEncoding);
    }

    print $file . "<br/>";
    var_dump(is_file($file));
    die();
}       

Output is:

D:/htdocs/otm/.data/_tmp/Lokaltest_KG_漢字_汉字_Krümhold-DEZ1604-140081-complete.zip
UTF-8
UTF-8
D:/htdocs/otm/.data/_tmp/Lokaltest_KG_o"[W_lI[W_Kr�mhold-DEZ1604-140081-complete.zip
NULL 

So converting from UTF-8 (database/internal encoding) to UTF-16 (windows file system) does not seem to work either.

I am at the end of my rope here and sadly the issue is very important to us since we cannot update our systems with this problem looming in the background. I hope somebody can shed a little light on this. Sorry for the long post, I'm not sure how well I could get my point across.


Addition:

$file = utf8_decode($file);
var_dump(is_file($file));
die();

Delivers false for the filename with the japanese letters. When I change the input used to create the filename so that the filename now is Lokaltest_KG_Krümhold-DEZ1604-140081-complete.zip above code delivers true. So utf8_decode helps but only with a small part of unicode, german umlauts?

  • 1
    Take a look on [this question](http://stackoverflow.com/questions/2685718/special-characters-in-file-exists-problem-php/2685818#2685818). Try with: `$winfilename= iconv('utf-8', 'cp1252', $utffilename);`. You may consider trying `SplFileInfo` as well – Mihai Matei May 10 '16 at 12:34
  • @MateiMihai `iconv('UTF-8', 'windows-1252', $file)` delivers a false and `iconv('UTF-8', 'cp1252, $file)` does as well. SplFileInfo($file) has nothing helpful to say either :( – thomasKberlin May 10 '16 at 12:54
  • 1
    @MateiMihai `utf8_decode` builds a proper SplFileInfo-object but `$info->isFile()` also gives me a false. If I remove the japanese chars and do the same (utf-8 decode and var_dump) i get a true for is_file. So it seems there is a further notch to the problem. Some chars can be dealt with by conversion but not all? – thomasKberlin May 10 '16 at 13:05
  • 1
    Are Windows filenames UTF8? http://stackoverflow.com/questions/2050973/what-encoding-are-filenames-in-ntfs-stored-as – GordonM May 10 '16 at 13:24
  • 1
    @GordonM no since Windows2000 / NTFS filenames are stored in UTF-16. The thing is that PHP (or underlying system components) should take care of the micro-management. And they did so all those years as mentioned. But PHP 7 seems to be broken there. It creates files with exotic filenames but does not access them. – thomasKberlin May 10 '16 at 13:30

1 Answers1

1

Answering my own question here: The actual bad boy was the component ZipArchive which created files with incorrectly encoded filenames. I have written a hopefully helpful bug report: https://bugs.php.net/bug.php?id=72200

Consider this short script:

print "php default_charset: ".ini_get('default_charset')."\n"; // just 4 info (UTF-8)

$filename = "bugtest_müller-lüdenscheid.zip"; // just an example
$filename = utf8_encode($filename); // simulating my database delivering utf8-string

$zip = new ZipArchive();
if( $zip->open($filename, ZipArchive::CREATE | ZipArchive::OVERWRITE) === true )
{
    $zip->addFile('bugtest.php', 'bugtest.php'); // copy of script file itself
    $zip->close();
}

var_dump( is_file($filename) );  // delivers ?

output:

output PHP 5.5.35:
    php default_charset: UTF-8
    bool(true)

output PHP 7.0.6:
    php default_charset: UTF-8
    bool(false)
  • 1
    Just in case anyone would need it, here is a PHP script which scans dirs and mass-converts file names into some other encoding: https://github.com/chang-zhao/encoding/ I used it to convert thousands of TXT files from "gibberish utf8" of old PHP into new readable ones. – chang zhao Nov 16 '18 at 20:43