31

I'm writing a file manager and need to scan directories and deal with renaming files that may have multibyte characters. I'm working on it locally on Windows/Apache PHP 5.3.8, with the following file names in a directory:

  • filename.jpg
  • имяфайла.jpg
  • file件name.jpg
  • פילענאַמע.jpg
  • 文件名.jpg

Testing on a live UNIX server woked fine. Testing locally on Windows using glob('./path/*') returns only the first one, filename.jpg.

Using scandir(), the correct number of files is returned at least, but I get names like ?????????.jpg (note: those are regular question marks, not the � character.

I'll end up needing to write a "search" feature to search recursively through the entire tree for filenames matching a pattern or with a certain file extension, and I assumed glob() would be the right tool for that, rather than scan all the files and do the pattern matching and array building in the application code. I'm open to alternate suggestions if need be.

Assuming this was a common problem, I immediately searched Google and Stack Overflow and found nothing even related. Is this a Windows issue? PHP shortcoming? What's the solution: is there anything I can do?

Addendum: Not sure how related this is, but file_exists() is also returning FALSE for these files, passing in the full absolute path (using Notepad++, the php file itself is UTF-8 encoding no BOM). I'm certain the path is correct, as neighboring files without multibyte characters return TRUE.

EDIT: glob() can find a file named filename-äöü.jpg. Previously in my .htaccess file, I had AddDefaultCharset utf-8, which I didn't consider before. filename-äöü.jpg was printing as filename-���.jpg. The only effect removing that htaccess line seemed to have was now that file name prints normally.

I've deleted the .htaccess file completely, and this is my actual test script in it's entirety (I changed a couple of file names from the original post):

print_r(scandir('./uploads/')); 
print_r(glob('./uploads/*'));

Output locally on Windows:

Array
(
    [0] => .
    [1] => ..
    [2] => ??? ?????.jpg
    [3] => ???.jpg
    [4] => ?????????.jpg
    [5] => filename-äöü.jpg
    [6] => filename.jpg
    [7] => test?test.jpg
)
Array
(
    [0] => ./uploads/filename-äöü.jpg
    [1] => ./uploads/filename.jpg
)

Output on remote UNIX server:

Array
(
    [0] => .
    [1] => ..
    [2] => filename-äöü.jpg
    [3] => filename.jpg
    [4] => test이test.jpg
    [5] => имя файла.jpg
    [6] => פילענאַמע.jpg
    [7] => 文件名.jpg
)
Array
(
    [0] => ./uploads/filename-äöü.jpg
    [1] => ./uploads/filename.jpg
    [2] => ./uploads/test이test.jpg
    [3] => ./uploads/имя файла.jpg
    [4] => ./uploads/פילענאַמע.jpg
    [5] => ./uploads/文件名.jpg
)

Since this is a different server, regardless of platform - configuration could be different so I'm not sure what to think, and I can't fully pin it on Windows yet (could be my PHP installation, ini settings, or Apache config). Any ideas?

Wesley Murch
  • 101,186
  • 37
  • 194
  • 228
  • 1
    Are you doing a `glob()` with a `*` mask? Re the `???????`, are you sure that isn't just a character set mismatch (between the file system's charset and your output charset)? – Pekka Mar 11 '12 at 22:40
  • @Pekka: yes, added the pattern to the post, no flags. – Wesley Murch Mar 11 '12 at 22:41
  • 1
    Ugh, that's really surprising behaviour. :( are you 100% sure only 1 element is returned? Did you do a `print_r()` on the raw `glob()` result? Remember, functions like `json_encode()` tend to silently drop stuff with invalid characters in them – Pekka Mar 11 '12 at 22:42
  • Yes I did a `var_dump()` and there's only 1 item in the array. Straight raw PHP with no funny business. PHP 5.3.8 by the way. – Wesley Murch Mar 11 '12 at 22:44
  • 2
    Not being helpful here; it works in Linux. But when run via `wine php.exe` I only get two out of three multibyte filenames, with the UTF-8 bytes misdecoded as `��`. So I would bet on charset issues as well. But have you tried `GlobIterator` instead? – mario Mar 11 '12 at 22:47
  • Selfishly adding `utf-8` tag so I'm sure to find this again in the future. – Pekka Mar 11 '12 at 22:48
  • @mario: I'm trying `GlobIterator` but can't get it to run even with copy/paste examples from the manual. "Uncaught exception 'LogicException' with message 'The parent constructor was not called: the object is in an invalid state" Thanks for the suggestion, I'll work on that. – Wesley Murch Mar 11 '12 at 22:52
  • Doesnt answer your question and im not sure if this will even solve this particular issue, but if youre open to using external libraries using the [Symfony Finder Component](http://symfony.com/doc/current/components/finder.html) would probably be a good fit for your end goal. – prodigitalson Mar 11 '12 at 23:14
  • Check the doc: http://php.net/manual/en/function.glob.php – Jon Egeland Mar 12 '12 at 01:11
  • @Jon: Thanks but I already have and there's nothing there. It's starting to become more apparent that this is an OS or configuration related problem that probably runs deeper than just `glob()`. Maybe someone using Windows and Apache can confirm these results? – Wesley Murch Mar 12 '12 at 01:15
  • `GlobIterator` does not behave differently either... – Wesley Murch Mar 17 '12 at 05:03
  • What a fascinating issue. Have you tried out `DirectoryIterator`? I'll set up a test case when I get home from work and give it a shot if you haven't. – MetalFrog Mar 30 '12 at 19:53
  • 1
    @MetalFrog: I have not tried `DirectoryIterator`. Did you check out the article linked in the answer I just accepted? – Wesley Murch Mar 30 '12 at 20:04
  • +1 from me for the question I am looking since long time but not resolved yet for me. – Smile Jul 18 '13 at 05:31

4 Answers4

7

It looks like the glob() function depends on how your copy of PHP was built and whether it was compiled with a unicode-aware WIN32 API (I don't believe the standard builid is.

Cf. http://www.rooftopsolutions.nl/blog/filesystem-encoding-and-php

Excerpt from comments on the article:

Philippe Verdy 2010-09-26 8:53 am

The output from your PHP installation on Windows is easy to explain : you installed the wrong version of PHP, and used a version not compiled to use the Unicode version of the Win32 API. For this reason, the filesystem calls used by PHP will use the legacy "ANSI" API and so the C/C++ libraries linked with this version of PHP will first try to convert yout UTF-8-encoded PHP string into the local "ANSI" codepage selected in the running environment (see the CHCP command before starting PHP from a command line window)

Your version of Windows is MOST PROBABLY NOT responsible of this weird thing. Actually, this is YOUR version of PHP which is not compiled correctly, and that uses the legacy ANSI version of the Win32 API (for compatibility with the legacy 16-bit versions of Windows 95/98 whose filesystem support in the kernel actually had no direct support for Unicode, but used an internal conversion layer to convert Unicode to the local ANSI codepage before using the actual ANSI version of the API).

Recompile PHP using the compiler option to use the UNICODE version of the Win32 API (which should be the default today, and anyway always the default for PHP installed on a server that will NEVER be Windows 95 or Windows 98...)

Then Windows will be able to store UTF-16 encoded filenames (including on FAT32 volumes, even if, on these volumes, it will also generate an aliased short name in 8.3 format using the filesystem's default codepage, something that can be avoided in NTFS volumes).

All what you describe are problems of PHP (incorrect porting to Windows, or incorrect system version identification at runtime) : reread the README files coming with PHP sources explaining the compilation flags. I really think that the makefile on Windows should be able to configure and autodetect if it really needs to use ONLY the ANSI version of the API. If you are compiling it for a server, make sure that the Configure script will effectively detect the full support of the UNICODE version of the Win32 aPI and will use it when compiling PHP and when selecting the runtime libraries to link.

I use PHP on Windows, correctly compiled, and I absolutely DON'T know the problems you cite in your article.

Let's forget now forever these non-UNICODE versions of the Win32 API (which are using inconsistantly the local ANSI codepage for the Windows graphical UI, and the OEM codepage for the filesystem APIs, the DOS/BIOS-compatible APIs, the Console APIs) : these non-Unicode versions of the APIs are even MUCH slower and more costly than the Unicode versions of the APIs, because they are actually translating the codepage to Unicode before using the core Unicode APIs (the situation on Windows NT-based kernels is exactly the reverse from the situation on versions of Windows based on a virtual DOS extender, such as Windows 95/98/ME).

When you don't use the native version of the API, your API call will pass through a thunking layer that will transcode the strings between Unicode and one of the legacy ANSI or CHCP-selected OEM codepages, or the OEM codepage hinted on the filesystem: this requires additional temporary memory allocation within the non-native version of the Win32 API. This takes additional time to convert things before doing the actual work by calling the native API.

In summary: the PHP binary you install on Windows MUST be different depending on if you compiled it for Windows 95/98/SE (or the old Win16s emulation layer for Windows 3.x, which had a very mimimum support of UTF-8, only to support the Unicode subsets of Unicode used by the ANSI and OEM codapges selected when starting Windows from a DOS extender) or if it was compiled for any other version of Windows based on the NT kernel.

The best proof that this is a problem of PHP and not Windows, is that your weird results will NOT occur in other languages like C#, Javascript, VB, Perl, Ruby... PHP has a very bad history in tracking versions (and too many historical source code quirks and wrong assumptions that should be disabled today, and an inconsistant library that has inherited all those quirks initially made in old versions of PHP for old versions of Windows that are even no longer officially supported, by Microsoft or even by PHP itself !).

In other words : RTM ! Or download and install a binary version of PHP for Windows precompield with the correct settings : I really think that PHP should distribute Windows binaries already compiled by default for the Unicode version of the Win32 API, and using the Unicode version of the C/C++ libraries : internally the PHP code will convert its UTF-8 strings to UTF-16 before calling the Win32 API, and back from UTF-16 to UTF-8 when retrieving Win32 results, instead of converting PHP's internal UTF-8 strings back/to the local OEM codepage (for the filesystem calls) or the local ANSI codepage (for all other Win32 APIs, including the registry or process).

Funk Forty Niner
  • 74,450
  • 15
  • 68
  • 141
virmaior
  • 424
  • 4
  • 14
  • 1
    I'm giving this the checkmark for now, the project has gone to the backburner and I don't have time to recompile PHP and test this yet - but it sounds correct. I did read this article as it was linked in another comment here, but I did not read all the comments on it. I'm going to add some context to your post. The comment does mention: *"...use the UNICODE version of the Win32 API (which should be the default today)"* (in 2010) – Wesley Murch Mar 30 '12 at 19:55
  • 1
    So for anyone else reading this, I cannot confirm this. I've accepted the answer because I *believe* it may be true, although frankly I have no idea how my local PHP version was compiled. After several manual re-installs, I just use XAMPP now. – Wesley Murch Mar 30 '12 at 20:01
  • @WesleyMurch, Did you find out the solution to your question. If yes the please provide because handling Unicode really hurts lot :( – Smile Jul 18 '13 at 05:32
  • I am so sorry I lost the comments to the original article (from my blog), but this commentor was indeed extremely mistaken, and this was quickly confirmed by one of the core devs in that same thread. Basically your only chance is to use COM or accept the fact that PHP on windows is not able to do this. – Evert May 07 '15 at 00:18
0

Try to set internal encoding inside in function (script).

setlocale(LC_ALL,'C.UTF-8');
  • "Try this" is not an answer to the question. If you have tested this code with the provided sample filenames from the question, and you know it works, please include that information in the answer. – miken32 Oct 29 '21 at 16:21
  • O.K. I used this script in function, which builds folder based tree menu, find all sub-folders and shows as tree menu, Georgian, Japanese, Russian names. It is too long and this page do not allow to write all function, but the part is like this: – Zurab Jamagidze Nov 06 '21 at 20:27
-1

Starting with PHP 7.1 long and UTF-8 paths on Windows are supported directly in the core.

Anatol Belski
  • 660
  • 6
  • 8
  • 1
    Please [don't post identical answers to multiple questions](http://meta.stackexchange.com/questions/104227/is-it-acceptable-to-add-a-duplicate-answer-to-several-questions). Post one good answer, then vote/flag to close the other questions as duplicates. If the question is not a duplicate, tailor your answers to the question. – miken32 Aug 16 '16 at 00:31
  • @miken32 yeah, thanks. Lesson learned already in another question. This issue is so many-faced, that it's probably impossible to chase it down to just one ticket. Quite a few tickets are there on the stack overflow, but also on the PHP bug tracker and else where. – Anatol Belski Aug 16 '16 at 23:41
  • 1
    Btw, not sure why it needs to be a copypasted text from a moderator, anyway :) – Anatol Belski Aug 17 '16 at 08:06
-1

PHP on windows does not use the Unicode API yet. So you have to use the runtime encoding (whatever it is) to be able to deal with non ascii charset.

Pierre
  • 716
  • 4
  • 10