8

I need to save files with non-latin filenames on a filesytem, using PHP.

I want to make this work cross-platform. How do I know what encoding I can use to write the file? I understand many modern filesystems are UTF-8 based (is this correct?), but I doubt Windows XP is (for instance).

So, is there a robust detection mechanism?

YakovL
  • 7,557
  • 12
  • 62
  • 102
Evert
  • 93,428
  • 18
  • 118
  • 189
  • I've always converted non-latin characters to the latin equivalent and stripped punctuation from the filename if I'm writing a file to disk. Can you guarantee your users will have the appropriate locale's installed? – Greg K Mar 26 '10 at 11:09
  • NTFS (as used in WinXP etc) uses utf-16. php 5.x on windows uses the codepage of IUSR, eg, latin. I hear php 6 will use utf16 on windows – SteelBytes Mar 26 '10 at 11:11
  • @Greg K: The project I'm working on is a WebDAV server, so I need a clean mapping. – Evert Mar 26 '10 at 11:31
  • This question is related to NTFS/Windows: [file_exists() and file_get_contents() fail on a file which is named output‹ÕÍÕ¥.txt in PHP?](http://stackoverflow.com/questions/6634832/file-exists-and-file-get-contents-fail-on-a-file-which-is-named-output/6634924#6634924), see as well [What encoding are filenames in NTFS stored as?](http://stackoverflow.com/q/2050973/367456) – hakre Nov 15 '11 at 12:04

2 Answers2

6

Not an answer to your question, but if you don't need to do extensive operations on filesystem level (like searching, sorting...), there is a nice cross-platform workaround for the issue outlined in this SO question: URLEncode()ing file names.

Hörensägen.txt 

gets turned into

H%c3%b6rens%c3%a4gen.txt

which should be safe to use in any filesystem and is able to map any UTF-8 character.

I find this much preferable to trying to "natively" deal with the host OS's capabilities, which is guaranteed to be complicated and error-prone (in addition to operating system differences, I'm sure the various filesystem formats - FAT16, FAT32, NTFS, extFS versions 1/2/3.... bring their own set of rules to be aware of.)

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Not a bad suggestion. I suppose I could provide the option. The question you linked to also mentions Windows uses ISO-8859-1. – Evert Mar 26 '10 at 11:25
  • @Evert not exactly, Windows's string handling has been UTF-16 based for a long time as far as I know, the answer claims *PHP's wrapper* to Windows' filesystem functions uses ISO-8859-1. I don't know for a fact whether that is true, but it is possible. – Pekka Mar 26 '10 at 11:28
0

PHP 7.1 supports UTF-8 filenames on Windows (I had a problem with serving a file with cyrillics in it's name until I've updated PHP – and Apache), so if you can just update PHP, that's the most robust and cross-platform solution these days.

I don't even need to ini_set('mbstring.internal_encoding','UTF-8'); for file_get_contents to work properly with non-latin paths.

YakovL
  • 7,557
  • 12
  • 62
  • 102