172

I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.

The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8"> is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.

What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text); but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/

For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).

I've read the other Stack Overflow questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").

But there must be something that at least has a good try!

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Grim...
  • 16,518
  • 7
  • 45
  • 61
  • 7
    It's basically not possible by definition to get absolutely correct, in reality the success rate of guessing an unknown encoding is not terrific. It's possible to use heuristics, but it will be correct less than 100% of the time, depending on the material *far less* than 100%. You need to be aware of that. Maybe somebody here can at least suggest a library with good heuristics though. – deceze Nov 02 '11 at 11:30
  • Sure, I know there's no perfect solution - hence the desire for something that will at least have a good go. – Grim... Nov 02 '11 at 11:32
  • this might help: http://stackoverflow.com/q/505562/642173 – Melsi Nov 02 '11 at 11:40
  • Have you tried using `UTF-8//IGNORE` as the 2nd param in `iconv`? – fire Nov 02 '11 at 12:04
  • Yeah, that's what I ended up doing. Not perfect, obviously, as then 'fiancée' becomes 'fiance', but it's certainly better. How come TRANSLIT doesn't work? – Grim... Nov 02 '11 at 12:28
  • Isn't it easier to **ASK** clients the language source (aka localization)? Saves you the headache in long run. – Alvin K. Nov 20 '11 at 02:12
  • Of course, part of the problem is that non-English words will crop up in English text fairly frequently (e.g. 'fiancée'), and the same problem occurs with other languages too - I remember when I was at school, there was a movement in France to purge phrases like 'le weekend'. – Phil Lello Nov 20 '11 at 21:20
  • possible duplicate of [Detect encoding and make everything UTF-8](http://stackoverflow.com/questions/910793/detect-encoding-and-make-everything-utf-8) – That Brazilian Guy Aug 22 '13 at 15:26
  • @Grim... I made a contribution aimed at those that attempt to solve this primarily with `mb_*` functions. It is kind of wild, but hey, why not? :-) If there was a way to get rid of `utf8_decode` and `utf8_encode`, it might be better. Perhaps `iconv`??? – Anthony Rutledge Mar 15 '17 at 16:00
  • @Grim... I found this, http://stackoverflow.com/a/3521396/1429677 excellent answer to this issue, here is the lib https://github.com/neitanod/forceutf8 – Llewellyn Mar 15 '17 at 22:10
  • my comment as of 2019, validate and accept the input from an utf-8 encoded page into utf8mb4 db as it is with prepared statements and take yout cautions while printing it to the screen. this will be safe and always readable without need of what is being asked. – Andre Chenier Jun 26 '19 at 15:24

12 Answers12

289

What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.

However, you could try doing this:

iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);

Setting it to strict might help you get a better result.

Jeff Day
  • 3,629
  • 1
  • 18
  • 12
  • 6
    Please, take a look at `mb_detect_encoding` source code in your php distro (somewhere here: ext/mbstring/libmbfl/mbfl/mbfl_ident.c). This function does not work properly at all. For some encodings it even has "return true", lol. Others are in Ctrl+c Ctrl+v functions. That's because you can not detect encoding without some kind of dictionary or statistic approach (like mine). – Oroboros102 Nov 18 '11 at 19:41
  • 2
    The way I understand it, `mb_detect_encoding` goes through the list of supplied encodings, and accepts the first one which has no invalid byte sequences in the string ... For encodings which have no invalid byte sequences such as ISO-8859-1, it's always true. No "smart" heuristics, and results vary greatly with the list (and order) of encodings you pass. – wutz Nov 20 '11 at 19:49
  • This seems to be working for me. My users were submitting text on a utf8 page with tinymce, yet for some unknown reason non utf8 characters sometimes ended up in the database. This fixed it, so thank you very much. – giorgio79 Oct 13 '12 at 14:27
  • @Jeff Day - Thanks for this. Pardon my ignorance, what do you mean 'Setting it to Strict'? – Ash501 Nov 26 '14 at 01:35
  • [Jeff Day] is sending `mb_detect_order()` even though it is the default value for this param, because he wanted to set strict encoding detection to true (the 3rd param) :) – jave.web Aug 18 '16 at 18:51
  • The ISO string `mb_detect_encoding('áéóú', 'UTF-8', true)` returns `false` and so does `iconv()`. I do not see a benefit compared to simply detecting if it is UTF-8: http://stackoverflow.com/a/4407996/318765 – mgutt Feb 10 '17 at 00:13
  • I propose `$encoding = mb_detect_encoding($text, 'ASCII, UTF-8, ISO-8859-1', true); $text = $encoding ? iconv($encoding, 'UTF-8//TRANSLIT', $text) : '';` instead. But finally it will not solve the problem as for example an input of UTF-16 will result an empty string as UTF-16 can not be detected. – mgutt Feb 10 '17 at 00:26
  • If I run the proposed `iconv()` command, and then run `mb_detect_encoding($encoded_text, mb_detect_order(), true)` on the encoded text, I still get `ASCII`, while the `iconv()` command should have supposedly encoded it to `UTF-8`... – kregus Dec 10 '19 at 09:29
30

In motherland Russia we have four popular encodings, so your question is in great demand here.

Only by character codes of symbols you can not detect the encoding, because code pages intersect. Some codepages in different languages have even full intersection. So, we need another approach.

The only way to work with unknown encodings is working with probabilities. So, we do not want to answer the question "what is encoding of this text?", we are trying to understand "what is most likely encoding of this text?".

One guy here in a popular Russian tech blog invented this approach:

Build the probability range of character codes in every encoding you want to support. You can build it using some big texts in your language (e.g., some fiction, use Shakespeare for English and Tolstoy for Russian, LOL). You will get something like this:

    encoding_1:
    190 => 0.095249209893009,
    222 => 0.095249209893009,
    ...
    encoding_2:
    239 => 0.095249209893009,
    207 => 0.095249209893009,
    ...
    encoding_N:
    charcode => probabilty

Next, you take text in an unknown encoding and for every encoding in your "probability dictionary" you search for the frequency of every symbol in the unknown-encoded text. Sum the probabilities of symbols. Encoding with the bigger rating is likely the winner. There are better results for bigger texts.

Btw, mb_detect_encoding certainly does not work. Yes, at all. Please, take a look of the mb_detect_encoding source code in "ext/mbstring/libmbfl/mbfl/mbfl_ident.c".

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Oroboros102
  • 2,214
  • 1
  • 27
  • 41
16

Just use the mb_convert_encoding function. It will attempt to autodetect character set of the text provided or you can pass it a list.

Also, I tried to run:

$text = "fiancée";
echo mb_convert_encoding($text, "UTF-8");
echo "<br/><br/>";
echo iconv(mb_detect_encoding($text), "UTF-8", $text);

and the results are the same for both.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Alexey Gerasimov
  • 2,131
  • 13
  • 17
5

There is no way to identify the character set of a string that is completely accurate.

There are ways to try to guess the character set. One of these ways, and probably/currently the best in PHP, is mb_detect_encoding. This will scan your string and look for occurrences of stuff unique to certain character sets. Depending on your string, there may not be such distinguishable occurrences.

Take the ISO-8859-1 character set vs ISO-8859-15.

There's only a handful of different characters, and to make it worse, they're represented by the same bytes. There is no way to detect, being given a string without knowing its encoding, whether byte 0xA4 is supposed to signify ¤ or € in your string, so there is no way to know its exact character set.

(Note: you could add a human factor, or an even more advanced scanning technique (e.g., what Oroboros102 suggests), to try to figure out based upon the surrounding context, if the character should be ¤ or €, though this seems like a bridge too far.)

There are more distinguishable differences between e.g. UTF-8 and ISO-8859-1, so it's still worth trying to figure it out when you're unsure, though you can and should never rely on it being correct.

Interesting read: How do I determine the charset/encoding of a string?

There are other ways of ensuring the correct character set though. Concerning forms, try to enforce UTF-8 as much as possible (check out snowman to make sure your submission will be UTF-8 in every browser: Rails and Snowmen)

That being done, at least you're can be sure that every text submitted through your forms is utf_8. Concerning uploaded files, try running the Unix 'file -i' command on it through, e.g., exec() (if possible on your server) to aid the detection (using the document's BOM).

Concerning scraping data, you could read the HTTP headers, that usually specify the character set. When parsing XML files, see if the XML meta-data contain a charset definition.

Rather than trying to automagically guess the character set, you should first try to ensure a certain character set yourself where possible, or trying to grab a definition from the source you're getting it from (if applicable) before resorting to detection.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
matthiasmullie
  • 2,063
  • 15
  • 17
  • Forms and email registration links with encrypted data. That is where I am trying to make my input be UTF-8 or nothing. What do you think of my answer? Helpful comments are appreciated. Thanks. – Anthony Rutledge Mar 15 '17 at 16:23
3

There are some really good answers and attempts to answer your question here. I am not an encoding master, but I understand your desire to have a pure UTF-8 stack all the way through to your database. I have been using MySQL's utf8mb4 encoding for tables, fields, and connections.

My situation boiled down to "I just want my sanitizers, validators, business logic, and prepared statements to deal with UTF-8 when data comes from HTML forms, or e-mail registration links." So, in my simple way, I started off with this idea:

  1. Attempt to detect encoding: $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];

  2. If encoding cannot be detected, throw new RuntimeException

  3. If input is UTF-8, carry on.

  4. Else, if it is ISO-8859-1 or ASCII

    a. Attempt conversion to UTF-8 (wait, not finished)

    b. Detect the encoding of the converted value

    c. If the reported encoding and converted value are both UTF-8, carry on.

    d. Else, throw new RuntimeException

From my abstract class Sanitizer

Sanitizer

    private function isUTF8($encoding, $value)
    {
        return (($encoding === 'UTF-8') && (utf8_encode(utf8_decode($value)) === $value));
    }

    private function utf8tify(&$value)
    {
        $encodings = ['UTF-8', 'ISO-8859-1', 'ASCII'];

        mb_internal_encoding('UTF-8');
        mb_substitute_character(0xfffd); //REPLACEMENT CHARACTER
        mb_detect_order($encodings);

        $stringEncoding = mb_detect_encoding($value, $encodings, true);

        if (!$stringEncoding) {
            $value = null;
            throw new \RuntimeException("Unable to identify character encoding in sanitizer.");
        }

        if ($this->isUTF8($stringEncoding, $value)) {
            return;
        } else {
            $value = mb_convert_encoding($value, 'UTF-8', $stringEncoding);
            $stringEncoding = mb_detect_encoding($value, $encodings, true);

            if ($this->isUTF8($stringEncoding, $value)) {
                return;
            } else {
                $value = null;
                throw new \RuntimeException("Unable to convert character encoding from ISO-8859-1, or ASCII, to UTF-8 in Sanitizer.");
            }
        }

        return;
    }

One could make an argument that I should separate encoding concerns from my abstract Sanitizer class and simply inject an Encoder object into a concrete child instance of Sanitizer. However, the main problem with my approach is that, without more knowledge, I simply reject encoding types that I do not want (and I am relying on PHP mb_* functions). Without further study, I cannot know if that hurts some populations or not (or, if I am losing out on important information). So, I need to learn more. I found this article.

What every programmer absolutely, positively needs to know about encodings and character sets to work with text

Moreover, what happens when encrypted data is added to my email registration links (using OpenSSL or mcrypt)? Could this interfere with decoding? What about Windows-1252? What about security implications? The use of utf8_decode() and utf8_encode() in Sanitizer::isUTF8 are dubious.

People have pointed out short-comings in the PHP mb_* functions. I never took time to investigate iconv, but if it works better than mb_*functions, let me know.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Anthony Rutledge
  • 6,980
  • 2
  • 39
  • 44
  • I found this, http://stackoverflow.com/a/3521396/1429677 excellent answer to this issue, here is the lib https://github.com/neitanod/forceutf8 – Llewellyn Mar 15 '17 at 22:14
2

It seems that your question is quite answered, but I have an approach that may simplify you case:

I had a similar issue trying to return string data from MySQL, even configuring both database and PHP to return strings formatted to UTF-8. The only way I got the error was actually returning them from the database.

Finally, sailing through the web I found a really easy way to deal with it:

Giving that you can save all those types of string data in your MySQL in different formats and collations, you only need to, right at your php connection file, set the collation to UTF-8, like this:

$connection = new mysqli($server, $user, $pass, $db);
$connection->set_charset("utf8");

Which means that first you save the data in any format or collation and you convert it only at the return to your PHP file.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Quel Pino
  • 31
  • 2
2

The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.

I don't think it's a problem. An application knows the source of the input. If it's from a form, use UTF-8 encoding in your case. That works. Just verify the data provided is correctly encoded (validation). Keep in mind that not all databases support UTF-8 in its full range.

If it's a file you won't save it UTF-8 encoded into the database, but in binary form. When you output the file again, use binary output as well, then this is totally transparent.

Your idea is nice that a user can tell the encoding, be he/she can tell anyway after downloading the file, as it's binary.

So I must admit I don't see a specific issue you raise with your question.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
hakre
  • 193,403
  • 52
  • 435
  • 836
1

There are a couple of libraries out there. onnov/detect-encoding looks promising. It claims to do better than mb_detect_encoding

Example usage for converting string in unknown character encoding to UTF-8:

use Onnov\DetectEncoding\EncodingDetector;
$detector->iconvXtoEncoding('Проверяемый текст')

To simply detect encoding:

$encoding = $detector->getEncoding('Проверяемый текст');
rosell.dk
  • 2,228
  • 25
  • 15
1

Because the usage of UTF-8 is widespread, you can suppose it being the default, and when not, try to guess and convert the encoding. Here is the code:

function make_utf8(string $string)
{
    // Test it and see if it is UTF-8 or not
    $utf8 = \mb_detect_encoding($string, ["UTF-8"], true);

    if ($utf8 !== false) {
        return $string;
    }

    // From now on, it is a safe assumption that $string is NOT UTF-8-encoded

    // The detection strictness (i.e. third parameter) is up to you
    // You may set it to false to return the closest matching encoding
    $encoding = \mb_detect_encoding($string, mb_detect_order(), true);

    if ($encoding === false) {
        throw new \RuntimeException("String encoding cannot be detected");
    }

    return \mb_convert_encoding($string, "UTF-8", $encoding);
}

Simple, safe and fast.

MAChitgarha
  • 3,728
  • 2
  • 33
  • 40
  • Haha wow, I asked this question 11 years ago (and I honestly can't remember why)! Thanks for your answer though, it was still interesting to read. I have a question, but just because I'm interested - why `!== false` instead of `=== true`? – Grim... Jul 15 '22 at 11:23
  • @Grim..., because the return type of `\mb_detect_encoding()` in this case is `string|false` (either `string` or `false`). It may never equal to `true`. Maybe you write code with a strongly-typed language currently. ;) – MAChitgarha Jul 15 '22 at 18:18
  • Hahaha it's been a while since I've used PHP for sure! Got to admit that I'd probably just use `if (!$utf8)` because I'm lazy :-) – Grim... Jul 18 '22 at 10:00
1

You could set up a set of metrics to try to guess which encoding is being used. Again, it is not perfect, but it could catch some of the misses from mb_detect_encoding().

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Parris Varney
  • 11,320
  • 12
  • 47
  • 76
1

If you're willing to "take this to the console", I'd recommend enca. Unlike the rather simplistic mb_detect_encoding, it uses "a mixture of parsing, statistical analysis, guessing and black magic to determine their encodings" (lol - see man page). However, you usually have to pass the language of the input file if you want to detect such country-specific encodings. (However, mb_detect_encoding essentially has the same requirement, as the encoding would have to appear "in the right place" in the list of passed encodings for it to be detectable at all.)

enca also came up here: How to find encoding of a file in Unix via script(s)

Community
  • 1
  • 1
wutz
  • 3,204
  • 17
  • 13
  • As [matthiasmullie said](https://stackoverflow.com/questions/7979567/php-convert-any-string-to-utf-8-without-knowing-the-original-character-set-or/8202819#8202819), statistical analysis may not be of much help. – Peter Mortensen Apr 20 '22 at 09:45
0

If the text is retrieved from a MySQL database, you may try adding this after the database connection.

mysqli_set_charset($con, "utf8");

mysqli::set_charset

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131