3

I'd like to be able to convert from any charset to clean UTF-8 in a single call (we're using PHP).

It's for Apache Solr indexing; the problem is that the XML Parser Solr uses (written in Java) throws an Exception whenever it encounters illegal UTF-8.

We tried iconv() but it sometimes clips the string after a Warning, losing some data, even with //TRANSLIT and/or //IGNORE enabled.

utf8_encode() only works with latin1.

We're importing many documents from many sources using many encodings and we need a perfectly clean UTF-8 output. We're not concerned by time/resource matters.

Thanks for your wise answers!

  • Maybe trying to fix the problems with `iconv` would be productive? Also, have you tried [`mb_convert_encoding`](http://php.net/manual/en/function.mb-convert-encoding.php)? – Jon Dec 08 '11 at 15:22
  • Whenever I had problem with both functions I tried htmlentities and afterwards html_entity_decode with utf-8 encoding. It's a workaround but it might just for your case too. – Ciprian Mocanu Dec 08 '11 at 15:27
  • @Jon : we have a function doing that but it's dirty and we're quite sure it's incomplete. We need something 100% tested and approved. If we happen to alter our fixes in some way, we'll have to re-index our data. Also, mb_convert_encoding needs an input charset; we're searching for a universal converter. – user1087972 Dec 08 '11 at 15:47
  • @CiprianMocanu: so you're using both functions one after another? Can this technique silently eliminate illegal UTF-8 characters? – user1087972 Dec 08 '11 at 15:51
  • @user1087972 It should eliminate illegal UTF-8 chars and if it doesn't, then preg_replace the hell out of that string! – Ciprian Mocanu Dec 09 '11 at 06:35
  • Take a look at my answer here: `http://stackoverflow.com/questions/910793/detect-encoding-and-make-everything-utf-8/3479832#3479832`. That function is also Latin1 centric, but it takes care of some problems on its own. – Sebastián Grignoli Jan 08 '13 at 14:36

1 Answers1

0
  • You can try using mb_convert_encoding and mb_detect_encoding instead.
  • When importing those documents, you should really want a content encoding or something. If you're indexing from the web, look for the content-type header and the contents of the actual HTML file. Always use this as your primary source - maybe fall back to detecting, but detecting is really just guessing.
  • If those two options didn't help, I'd suggest writing your own code to detect invalid characters in the stream. Then just replace these and use iconv().

The reason iconv doesn't continue after an error is pretty simple: in some character encodings it's important that bytes are read correctly because a character may be based on multiple bytes. UTF-8 compensates this by using a bitmask to detect when the character is done, but not all encodings have this. In such an encoding a single byte being wrong means the rest of the string could be garbled, which isn't what you want. (I'm not entirely sure, but you should be able to replicate this by taking an UTF-16 string and removing the fifth byte in the file)

Hey, I'll even illustrate the issue :-) Below is a (sort of) UTF-16 example which uses 2 bytes per character.

[74 00] [65 00] [73 00] [74 00] = test

Now let's remove a single byte - here it's the first 0x00

[74 65] [00 73] [00 74] [00] = ....

I have no idea what it would actually become, but as you can see it simply breaks up the rest of the string the moment one byte is missing. If you're lucky you'd be indexing in Chinese.

Tom van der Woerdt
  • 29,532
  • 7
  • 72
  • 105