3

I'm writing a small web app that will receive and parse tab-delimited text files from users. Those files will either be uploaded via a textarea or a multipart/form-data form. Those files will be in a variety of charsets, including Asian and the like. In consequence I am trying to use utf-8 throughout the app.

The site is entirely (as far as I know) in UTF-8:

  • Each php file is saved in utf-8 encoding;
  • I have added default_charset = "utf-8" in my php.ini file;
  • The HTML header contains the required utf-8 mentions:

    header('Content-Type:text/html; charset=UTF-8');
    ...
    <?xml version="1.0" encoding="utf-8" ?>
    ...
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
    
  • The textarea forms contain the accept-charset="UTF-8" tag.

  • The db is collated in utf-8;
  • Each connection to the db includes the option 1002 => 'SET NAMES utf8'.

Now, I just discovered that I needed to set mb_regex_encoding to utf-8 manually for one of my parsing function to work (I use mb_split() to identify & replace tabs and new lines). So ...

What else do I need to do to make sure my site is once and for all utf-8 throughout? In particular, are there any other encoding function I should set, such as mb_internal_encoding(), and if so where in the code should I do that (e.g., at the start of the index.php file?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
JDelage
  • 13,036
  • 23
  • 78
  • 112

1 Answers1

1

I can think of two more things;

mb_internal_encoding('UTF-8');

...as early as possible in the PHP script, and

mysqli_set_charset($link, 'utf8');

...to set the connection charset, if you're using MySql. For PDO, you can specify it with the connection string:

"mysql:host=$host;dbname=$db;charset=utf8"
Community
  • 1
  • 1
jgivoni
  • 1,605
  • 1
  • 15
  • 24
  • Reg. `mb_internal_encoding()`, should I do that before or after `session_start()`? – JDelage Feb 23 '12 at 22:24
  • @JDelage: `mb_...` is not related to `session_start` if you don't use serialized objects that use `mb_...` function while deserializing. – hakre Feb 23 '12 at 22:28
  • You should set the internal encoding for multibyte string manipulation functions **before** you try to manipulate any multibyte strings. I don't think session_start() implies any string manipulation, so I wouldn't use that as a cursor. – jgivoni Feb 23 '12 at 22:29
  • I'm only using that as the first thing that is read by the script. My code starts with `require_once(objects in session);` and then `session_start()`. – JDelage Feb 23 '12 at 22:36
  • Wow, both question and 'my' answer have been edited so much here that I hardly recognize them anymore... – jgivoni Feb 24 '12 at 22:39