2

What is the best way to convert user input to UTF-8?

I have a simple form where a user will pass in HTML, the HTML can be in any language and it can be in any character encoding format.

My question is:

  • Is it possible to represent everything as UTF-8?

  • What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?

I am trying to work out how to best implement this - advice and links appreciated.

I am making use of Codeigniter and its input class to retrieve post data.

A few points I should make:

  • I need to convert HTML special characters to their respective entities
  • It might be a good idea to accept encoding and return it in that same encoding. However, my web app is making use of :

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

This might have an adverse effect on things.

Abs
  • 56,052
  • 101
  • 275
  • 409

6 Answers6

4

Specify accept-charset in your <form> tag to tell the browser to submit user-entered data encoded in UTF-8:

<form action="foo" accept-charset="UTF-8">...</form>

See here for a complete guide on HOW TO Use UTF-8 Throughout Your Web Stack.

Joseph Silber
  • 214,931
  • 59
  • 362
  • 292
  • What would happen if a user pastes in HTML from their editor which is in the `windows-1252` or some sort of `iso` encoding? Would the browser have no trouble converting this? Thank you for the link, looks super useful/thorough. – Abs Aug 22 '11 at 18:45
  • 1
    The browser should automatically send the info with the correct character encoding... – Joseph Silber Aug 22 '11 at 18:47
  • This might not work in IE according to: http://www.w3schools.com/tags/att_form_accept_charset.asp - have you experienced any problems with IE? – Abs Aug 22 '11 at 18:47
  • I have personally never had any problems with IE on this. What you saw there is that *if `accept-charset="ISO-8859-1"`, IE will send data encoded as "Windows-1252"*. That's to say: IE has trouble with `ISO-8859-1`, not with the `accept-charset` attribute *per se*. – Joseph Silber Aug 22 '11 at 18:50
  • 1
    @Abs: That attribute is informative only. It does not technically prevent that any kind of data is send to your PHP script. – hakre Aug 22 '11 at 18:55
  • 1
    @hakre Technically true, but then you're just sh*t-outta-luck. :) You can't really do more than specify what you expect, clients will need to comply or all bets are off. – deceze Aug 22 '11 at 23:40
  • @deceze: Well you can actually look which encoding is signalled by the browser for the actual request. That's another location to look into. However that can be tainted as well, so never trust user-data :) – hakre Aug 23 '11 at 07:29
  • This along with deceze has helped me solve my issue, thank you! – Abs Aug 26 '11 at 22:13
2

Is it possible to represent everything as UTF-8?

Yes, UTF-8 is a Unicode encoding, so you can use any character defined in Unicode. That's the best you can do with a computer to date.

What can I use to effectively convert any character encoding to UTF-8

iconv lets you convert virtually any encoding to any other encoding. But, for that you have to know what encoding you're dealing with. You can't say "iconv, whatever this is, make it UTF-8!". That's unfortunately not how it works. You can only say "iconv, I have this string here in BIG5, please convert that to UTF-8.".

If you're only dealing with form data in UTF-8 though, you'll probably never need to convert anything.

so that I can parse it with PHP string functions

"PHP string functions" work on bytes. They don't care about characters or encodings. Depending on what you want to do, working with naive PHP string functions on UTF-8 text will give you bad results. Use encoding-aware string functions in the MB extension for any multi-byte encoding string manipulation.

save it to my database

Just make sure your database stores text in UTF-8 and you have set your database connection to UTF-8 (i.e. the database knows you're sending it UTF-8 data). You should be able to specify that in the CodeIgniter database connection settings.

subsequently echo out using htmlentities?

Just echo htmlentities($text), nothing more you need to do.

However, my web app is making use of : <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

This might have an adverse effect on things.

Not at all. It just signals to the browser that your page is encoded in UTF-8. Now you just need to make sure that's actually the case (as you're trying to do anyway). It also implies to the browser that it should send UTF-8 to the server. You can make that explicit with the accept-charset attribute on forms.

May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which might help you understand more.

Community
  • 1
  • 1
deceze
  • 510,633
  • 85
  • 743
  • 889
  • @hakre I'd like to hear your objection to UTF-8 in the database. What would you prefer? – deceze Aug 23 '11 at 07:59
  • +1: Well done answer, some PHP functions (next to mb) have different encoding support however. And avoid having UTF-8 in the MySQL database when you don't need to. But well, defer the details :). – hakre Aug 23 '11 at 08:03
  • 1
    MySQL: There are two things: Storage requirements and character support. MySQL uses three bytes per character for UTF-8 which can lead to have a `a` to consume more bytes (then needed) for some tables, e.g. temporary tables which can cause trouble/performance drain. Next to that not all panes of Unicode are supported, MySQL supports the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0. – hakre Aug 23 '11 at 08:06
  • @hakre Interesting, I have never looked into that. MySQL 5.5+ supports Unicode 5.0 though. UTF-8 still is wasteful apparently. – deceze Aug 23 '11 at 08:11
  • Three bytes still in 5.5: *"The utf8 character set is the same in MySQL 5.5 as before 5.5 and has exactly the same characteristics: [...]"* [Ref](http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8.html). - Lookout for varchar / char columns, char needs to reserve the max bytes needed per character, even if not needed as well. – hakre Aug 23 '11 at 08:14
1

1) Is it possible to represent everything as UTF-8?

Yes, everything defined in UNICODE. That's the most you can get nowadays, and there is room for the future that UNICODE can support.

2) What can I use to effectively convert any character encoding to UTF-8 so that I can parse it with PHP string functions and save it to my database and subsequently echo out using htmlentities?

The only thing you need to know is the actual encoding of your data. If you want your webapplication to support UTF-8 for input and output, the frontend needs to signal that it supports UTF-8. See Character Encodings for a guide regarding your applications user-interface.

Within PHP you need to feed any function with the encoding it supports. Some need to have the encoding specified, for some you need to convert it. Always check the function docs if it supports what you ask for. Additionally check your PHP configuration.

Related:

  1. Preparing PHP application to use with UTF-8
  2. How to detect malformed utf-8 string in PHP?
Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • I would like a `[citation-needed]` for the claim that UTF-8 cannot encode all Unicode code points! – deceze Aug 22 '11 at 23:43
  • @deceze: Is that enough for a starter? *"RFC 3629 UTF-8 November 2003 3. UTF-8 definition UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646] In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. "* - The UTF-16 range is just not the full range. So isn't UTF-8. – hakre Aug 23 '11 at 07:25
  • Then I'll ask you where it says that UTF-16 is not the full range. Every piece of documentation I am looking at says that the current Unicode range is 000000 - 10FFFF and that all UTF encodings can encode all of these points. UTF-8 was originally even designed to use up to six octets, which means it could encode many more points if necessary. – deceze Aug 23 '11 at 07:43
  • That was rather nitpicky indeed. :) – deceze Aug 23 '11 at 07:59
  • @deceze: Looks I'm too much nitpicking here. I'll change the answer, the panes I'm referring to are not defined yet. currently 21 bits are in use only, safe for UTF-8. UTF-8 encodings excludes some surrogates but includes some non-character code-points. – hakre Aug 23 '11 at 08:01
0

I found out that the only thing that works out for UTF-8 encoding is setting inside my config.php

putenv('LC_ALL=en_US.utf8'); // or whatever language you need
setlocale(LC_ALL, 'en_US.utf8');  // or whatever language you need
bindtextdomain("mydomain", dirname(__FILE__) . "/../language");
textdomain("mydomain");
gmaliar
  • 5,294
  • 1
  • 28
  • 36
0

If you want to change the encoding of a string you can try

$utf8_string = mb_convert_encoding( $yourBadString , 'UTF-8' );
  • 1
    Convert *from what* is the question though. If you don't know that, you can't reasonably and reliably convert anything. – deceze Aug 22 '11 at 23:47
  • If you don't know then you can use mb_detect_encoding() to find out. Though I've never had to detect encoding to force it to UTF-8, the third param of mb_convert_encoding is optional and not needed. – George Velez Aug 23 '11 at 16:05
  • If you don't supply the third parameter, it just defaults to the internally set encoding. Auto-detecting an encoding is somewhere between very very tricky to impossible, at the very least it's not perfectly reliable. It's all just bits, and often a bit sequence is equally valid in many different encodings, so "auto-detecting" often just comes down to guessing. – deceze Aug 24 '11 at 01:10
  • Yes if you don't supply the 3rd param it will default, to the internal encoding. But I disagree when you say it is "very very tricky to impossible" we do this all the time in our applications. Working with the DoD allows us the opportunity to deal with a wide (total) different languages, currencies and encodings since we obviously have troops all over the globe. We have never had a problem with this technique. – George Velez Aug 24 '11 at 19:22
  • Then apparently you're not really dealing with a lot of ambiguous encodings: http://www.ideone.com/q2Skp – deceze Aug 25 '11 at 01:35
-1

EDIT :

Is it possible to represent everything as UTF-8?

Yes, these is what you need to ensure :

  • html : headers/meta-header set to utf-8
  • all files saved as utf-8
  • database collation, tables and data encoding to utf-8

What can I use to effectively convert any character encoding to UTF-8

You can use utf8_encode (Since for a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation,ref) before saving it into your database.

// eg
$name = utf8_encode($this->input->post('name'));

And as i mention before, you need to make sure database collation, tables and data encoding to utf-8. In CI, at your database connection config

// Make sure have these lines
$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';
toopay
  • 1,635
  • 11
  • 18
  • `utf8_encode` only converts from latin-1 to UTF-8. If the user is not sending you latin-1, this function is useless. If the user *is* sending you latin-1, you can only support the 256 characters of the latin-1 encoding. If you can specify to the user to send you latin-1, you can as well specify that you want UTF-8 directly. – deceze Aug 22 '11 at 23:41
  • @deceze, thanks for remind me not too simplify the question. I update my answer for your downvote(yay). My previous answer was indeed too over simplify the question. Lazy is my virtue (lol) :) – toopay Aug 23 '11 at 11:02