0

This question is different from UTF-8 all the way through as it asks for how safe and is it a good practice to use the mb_convert_encoding function.

Lets say that a user can upload the files using the PHP API. Each filename and path gets stored in a PostgreSQL database table which has UTF-8 as default encoding.

Sometimes user uploads files which names aren't UTF-8 encoded and they get imported into the database. The problem is that the characters that are not UTF-8 encoded are scrambled and do not display as they should in the table columns.

I was thinking of adding the following to the PHP code before import:

if ( ! mb_check_encoding($output, 'UTF-8') {
    $output = mb_convert_encoding($content, 'UTF-8'); 
}

Does this look like a good practice and will it be displayed and converted by the user's client correctly if I return UTF-8 as the output? Is there a potential loss to the bytes by using mb_convert_encoding?

Thanks

Armin
  • 25
  • 5

1 Answers1

1

If you're going to convert an encoding, you need to know what you're converting from. You can check whether the encoding is or isn't valid UTF-8, but if it tells you it's not valid UTF-8 then you still have no clue what it is. Omitting the $from_encoding parameter from mb_convert_encoding just makes it assume some preset encoding for that parameter, but that doesn't mean that $content actually is in that encoding.

In other words: if you don't know what encoding a string is in, you cannot meaningfully convert it to anything else either, and just trying to convert it from ¯\_(ツ)_/¯ is a crapshoot with the result being equally likely to be something useful and utter garbage.

If you encounter unknown encodings, you only have a few choices:

  1. Reject the input value.
  2. Test whether it's one of a handful of other expected encodings and then explicitly convert from your best guess; but that is pretty much a crapshoot as well.
  3. Just use bin2hex or something similar on the value, essentially giving up on trying to interpret it correctly, but still leaving some semblance to the original value.
deceze
  • 510,633
  • 85
  • 743
  • 889