8

Our CMS accepts files with national characters in their names and stores them on the server without a problem. But how bad is such approach in perspective? For example is it possible to store files with filenames in Hebrew, or Arabic or in any other language with non-latin alphabet? Is there a standard established way to handle these?

jayarjo
  • 16,124
  • 24
  • 94
  • 138
  • Just found might-be problem topic here: http://stackoverflow.com/questions/93551/how-to-encode-the-filename-parameter-of-content-disposition-header-in-http. Which basically says that there is no established (that is cross-browser) way to force-download files with non US-ASCII filenames. – jayarjo Dec 02 '10 at 08:24

6 Answers6

7

A standard way would be to generate unique names yourself and store the original file name somewhere else. Typically, even if your underlying OS and file system allow arbitrary Unicode characters in the file name, you don't want users to decide about file names on your server. Doing so may impose certain risks and lead to problems, e.g. caused by too long names or file system collisions. Examples of sites that do that would be Facebook, flickr and many other.

For generating the unique file name Guid values would be a good choice.

Dirk Vollmar
  • 172,527
  • 53
  • 255
  • 316
4

Store the original filename in a database of some sort, in case you ever need to use it.

Then, rename the filename using a unique alphanumeric id, keeping the original file extension.

If you expect many files then you should create directories to group the files. Using the year, month, day, hour and minute is usually enough for most. For example:

.../2010/12/02/10/28/1a2b3c4d5e.mp3

Yes, I've had experience with massive mp3 collections which are notorious for being named in the language of the country where the song originates which can cause trouble in several places.

zaf
  • 22,776
  • 12
  • 65
  • 95
  • Could you possibly share examples of where you encountered problems with such files? – jayarjo Dec 02 '10 at 09:41
  • System calls that need filenames as parameters and shell scripts come first to mind. Also, with software upgrades, filenames for example may start to work in some ways now and break in new ways. – zaf Dec 02 '10 at 14:15
2

It's fine as long as you detect the charset it's in from the headers in the request, and use a consistent charset (such as UTF-8) internally.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
1

On a Unix server, it's technically feasible and easy to accept any Unicode character in the filename, and then convert filenames to UTF-8 before saving them. However, there might be bugs in the conversion (in the HTML templating engine or web framework you are using, or the user's web browser), so it might be possible that some users will complain that some files they have uploaded disappeared. The root cause might be buggy filename conversion. If all characters in the filename or non-latin, and you (as a software developer) don't speak that foreign language, then good luck figuring out what has happened to the file.

pts
  • 80,836
  • 20
  • 110
  • 183
0

It is an excellent idea. Being Hungarian, I'm pretty annoyed when I'm not allowed to use characters like áÉŰÖÜúÓÚŰÉÍí :)

neo2862
  • 1,496
  • 1
  • 13
  • 27
0

There is a lot of software out there that has bugs regarding dealing with such file names, especially on Windows.

Udpate: Example: I couldn't use the Android SDK (without creating a new user), because I had an é in my user name. I also ran into a similar problem with the Intel C++ compiler.

Software usually isn't tested properly with such file names. The Windows API still offers "ANSI" encoded versions of functions, and many developers don't seem to understand its potential problems. I also keep on coming across webpages that mess up my name.

I don't say don't allow such file names, in fact in the 21st century I would expect to be able to use such characters everywhere. But be prepared that you may run into problems.

darklon
  • 468
  • 3
  • 13
  • Please elaborate. All _certified_ Windows software is tested for this specifically. – MSalters Dec 02 '10 at 08:17
  • @MSalters: not all software is Windows certified. I added some examples. – darklon Dec 02 '10 at 08:32
  • "The Windows API uses ANSI encoding" ? It did, back in the Windows 95 days. But NT has always been Unicode (UTF-16) based - see the A/W variants. On W95, the W varaint was mapped to A, but on NT it's reversed. COM API's have always been Unicode-only (BSTR) – MSalters Dec 02 '10 at 12:27
  • @MSalters: You are right, but it still offers the "ANSI" encoded versions of functions, and developers tend to use them. I corrected my post. – darklon Jul 20 '11 at 17:02