0

I, like many other PHP developers have had issues with character encoding, the question will outline the steps I go through to ensure that my data is saved and outputted as UTF8. I would like any advice on what else I should consider and or change with my current thinking.

I have a mysql database DEFAULT CHARACTER UTF-8 my tables have collation of utf8_general_ci

I am using a php script to read data from an RSS feed then saving that data to by database. Before I save that data I check to see whether that data is UTF-8 or not by doing the following:

protected function _convertToUTF8($content) {
    $enc = mb_detect_encoding($content);
    return mb_convert_encoding($content, "UTF-8", $enc);
}

When outputting this data to a webpage I set the headers in php

header("Content-type: text/html; charset=utf-8");

and I also set the Content-Type meta tag to be utf-8

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

So far everything works as expected I get no funny characters outputting and all is going smoothly, but should I be changing/considering anything else when dealing with this data?

The problem I am now having is outputting this data to a txt file (csv) I am using fwrite() which has successfully created the file but the 3rd party I am passing this file to says that the file is not UTF-8. I am not sure the data is being outputted as UTF-8, how can I check this? When logged into the remote server over SSH and i cat the file i get Itâs a when I vim the file I get Itâ~@~Ys when i less the file I get It<E2><80><99>s. What am I missing here?

Thanks in advance!

dynamic
  • 46,985
  • 55
  • 154
  • 231
Lizard
  • 43,732
  • 39
  • 106
  • 167
  • could there be a BOM in the file it's complaining about? or maybe it wants a BOM in the file? I know I've had trouble with that (not in this particular application) before and I've had to resave a file (in utf-8) without the BOM to get it to work correctly. – kinakuta Jun 13 '11 at 22:31
  • yes but hopefully I have explained clearly and get a good answer that will help people in the future, as post questions have been vague and generally related to output and not saving of data. – Lizard Jun 13 '11 at 22:35
  • for example, see this post about an editor that doesn't interpret the file correctly as utf-8 without a BOM: http://stackoverflow.com/questions/2558172/utf-8-bom-signature-in-php-files – kinakuta Jun 13 '11 at 22:38
  • This question is asking for opionions and facts on the overall process rather than the individual error. – Lizard Jun 13 '11 at 22:43
  • You've got a well known opinion about this in my answer here and a very general hint for the overall process ;) – hakre Jun 13 '11 at 22:54

2 Answers2

4

You can not detect the encoding of any data. Encoding is always meta-information next to the data itself.

Even mb_detect_encoding() tries it's best to do so, you should never use it to handle data automatically. Because as it's not possible to detect encoding from the data itself, this function can not as well.

Don't rely on it. Use it only for manual inspection in case you need to debug a problem or in the last resort of fallback, but never in the standard data processings. An even then, do not trust that information too much.

How can I say so? Just an example: A text can be validly US-ASCII encoded and a detection routine for UTF-8 will return that it's valid UTF-8 encoded. And that's just one example. The truth is, this is just much more complex.

So take it for granted that you can not detect the encoding from the raw data.

Instead, look for the meta information that specifies the encoding. If no encoding information is given, lookup the default encoding in the specification documents for the transport of data.

In your case of storing data from RSS feeds, lookup the information either in the response headers and/or the XML prologue. It normally contains the encoding in ISO notation of the document.

As your database expects data encoded as UTF-8 your processing must take care that only UTF-8 data is put into the database. So check and acquire the encoding of the data and then do the steps needed to change the encoding. But do not rely on mb_detect_encoding() to perform these steps.

hakre
  • 193,403
  • 52
  • 435
  • 836
0

In the end it was a BOM that was required for the external application to read the file properly.

Lizard
  • 43,732
  • 39
  • 106
  • 167