1

We run a site where users upload image files. When these files are produced on Mac, sometimes they have UTF-8 characters in the file names (Since mac uses UTF-8 as its file system character set).

When PHP7 code receives these files, we have to store them in the local file system, which is Debian Linux, and does not support UTF-8.

Also, while PHP7 can support UTF-8, it does not do it natively or automatically.

So, the question is: what's the current best practice for handling this?

Thought 1:

Save the original name in the database (Collation = utf8mb4_unicode_ci? ), and then store the images using a UUID on disk. Then, use the download="" to have the file download as the original file name.

Pro: Seems to solve the problem.

Con: multibyte support seems to be kludge and clunky in PHP (even in 7.2.x+). Does this require a ton of checks in order to deal with it?

Thought 2:

Sanitize / filter out the UTF-8 characters from the file name to avoid the problem altogether.

Pro: I can use latin collation in MySQL / MariaDB like we always have AND I don't have to worry about the file system charsets.

Con: This is lossy. A file named touche'.pdf will get renamed touch.pdf OR I have to create some equivalency tables to turn e' into e.

Thought 3

I have over-thought this problem, or I am missing a simple solution.

What's the best way to deal with uploaded filenames that are UTF-8 / Multibyte?

DrDamnit
  • 4,736
  • 4
  • 23
  • 38
  • PHP has a rich set of [Multibye String](http://php.net/manual/en/book.mbstring.php) handling options and perhaps the [mb-convert-encoding](http://php.net/manual/en/function.mb-convert-encoding.php) function would be helpful. – Dave Aug 17 '18 at 18:28
  • For the database side, use the `CHARSET utf8mb4`, collation is just the order. Try do deal with it natively. Seems to be some OSX utf8 filesystem discussion https://stackoverflow.com/questions/6153345/different-utf8-encoding-in-filenames-os-x#6153713 – danblack Aug 18 '18 at 04:33
  • Seems I have an answer for both of my thoughts above. Seems I can either use `mb_detect_encoding` to determine how to process the character and then use `mb_convert_encoding` to transmute it, or go with the UUID solution and `utf8mb4`, both will work. Seems the latter will provide me with the most "universal" solution (what if someone puts emoji's in the filename?). So, I'll go with that for now unless we have a compelling reason not to. I need an answer to accept / close this question if someone wants to put that down there... – DrDamnit Aug 20 '18 at 11:32
  • Solution 1 is the best one, unquestionably. Not sure what concerns you have regarding PHP's multibyte support. – deceze Aug 23 '18 at 17:18

1 Answers1

0

Consider PHP's urlencode() to turn UTF-8 characters into % plus hex.

fn        'smiley-☺'
urlencode 'smiley-%E2%98%BA'
bin2hex   '736d696c65792de298ba'

I might prefer simply applying urlencode to every entry -- names in plain ascii will be unchanged. And I don't think the % will cause trouble. Other punctuation may cause trouble (eg /).

Rick James
  • 135,179
  • 13
  • 127
  • 222