1

Currently user can upload files as they like. So in the uploaded files there are spaces, characters like ß, ü and so on. Than other users can download these files (including white spaces in the URL and so on). It works in this way but according to RFC1738 - Uniform Resource Locators (URL) only alphanumeric characters [a-zA-Z0-9] and some special/reserved characters are allowed. Also empty spaces should be avoided I think.

Currently I get for a ß a ß in the file name on the server. The user who wants to download the file gets the correct character (ß) represented from the MySQL database (utf8_unicode_ci) and so the file can be found on the server.

  • What is the correct way to handle file names?
  • Should I make a filename check and disallow the upload?
  • Should I rename the files on the server after the user upload (e.g. str_replace(), urlencode(), ...)?
hakre
  • 193,403
  • 52
  • 435
  • 836
testing
  • 19,681
  • 50
  • 236
  • 417

2 Answers2

2

As long as your webserver takes care of handling the file downloads, ensure that it knows about the encoding on the file-system and the file-system is compatible to the charset you use for the file-names of the uploads you handle.

As long as everything is compatible here (it looks like you use UTF-8), you won't run into any problems. Just ensure the encoding is set right @ every place you make use of (file-system, webserver, data-base server, database-client-connection, browser, upload POST request, file-link-offering HTTP HTML response etc.).

If you intend to serve the files by PHP with the Content-Disposition header you should only allow the followinig character within file-names:

a-z, A-Z, 0-9, _, - , .

That's because that header has no working specification for characters outside of the US-ASCII printable range.

Normally when a file is uploaded, it's filename get's normalized. It's also wise to do some validation / sanitizing at the point of upload.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • How do I find out the "encoding on the file-system and the file-system is compatible to the charset you use for the file-names of the uploads"? Yes, UTF-8 should be used. Webserver is Apache, database server is MySQL 5.0.77, connection is set with `SET NAMES utf8`, browser could be anyone, upload is via `POST` and `multipart/form-data` and encoding of HTML is `UTF-8`. In Firebug the file-link-offering HTTP HTML response has `Content-Type application/x-www-form-urlencoded` but I have never done something (default?). – testing Apr 17 '12 at 11:43
  • 1
    So `Content-Disposition` would not be the thing I want. How does a filename get's normalized and validated? – testing Apr 17 '12 at 11:44
  • @testing: You do that on your own with string processing. – hakre Apr 17 '12 at 11:53
  • But which characters do I replace? Each language has it own characters and I don't want to replace all of them. So is there a standard set/function? Or is allowing `a-z, A-Z, 0-9, _, - , . ` the way to go? – testing Apr 17 '12 at 12:03
  • @testing: That depends on your needs and what you want to do. There does exists something called [*transliteration*](http://stackoverflow.com/questions/1284535/php-transliteration), but you can also *remove* (strip) or *encode* (like `rawurlencode`). Many things are possible, the first two will destroy information, while encoding can preserve that information which might be more future-proof. – hakre Apr 17 '12 at 12:06
  • Thanks for the information. I included `rawurlencode` as Jon menitoned but I can't figure out a difference. The function is placed on the output of the download link. – testing Apr 17 '12 at 12:10
  • 1
    @testing: You would first of all `rawurlencode` the filename when uploading and store it that way on disk and inside db. Then when ouputting, as it's an URL with special characters, you would need to `rawurlencode` it again. Take care. – hakre Apr 17 '12 at 12:11
  • I tried this and now I have cryptical file names (%C3%9). Is there a way to have nicer file names or is that the price for that? – testing Apr 17 '12 at 12:33
  • That is the price of the encoding. – hakre Apr 17 '12 at 12:34
  • Now I tried to delete such an encoded file via FTP. I get `Forbidden command argument`. Do you know why? Renaming is also not possible. Deleted the file with the managment webtool from the provider. – testing Apr 17 '12 at 13:17
  • It looks like that your FTP client is not able to handle file-names properly. Consider switching the FTP client. – hakre Apr 17 '12 at 13:18
  • are you sure? WinSCP normally rocks. – hakre Apr 17 '12 at 15:17
1

What is the correct way to handle file names?

You are already handling them it seems; wrap the filenames in rawurlencode before putting them in URL parameters to be spec-compliant as well.

Should I make a filename check and disallow the upload?

No, that would only serve to annoy your users.

Should I rename the files on the server after the user upload?

This can be a good idea. You can generate a "random" name with the technique of your choice and save the "original" name in the database. Whenever the user wants to download the file, give it back to them with the name they used to upload it through the Content-Disposition HTTP header.

Advantages of doing this include making certain that you won't get bitten by subtle differences between the filesystem of each user and the filesystem of your server and avoiding duplicate file name issues.

Community
  • 1
  • 1
Jon
  • 428,835
  • 81
  • 738
  • 806
  • 1
    What dou you mean with "You are already handling them it seems"? I only move them to another location and save the original filename in the database. Does the content-disposition works for all browsers? Should the user be informed only to use alphanumeric characters? Does the generation of a random name has disadvantages for SEO? When does the current situation would lead to problems (it works but why)? – testing Apr 17 '12 at 11:19