1

I'm reading a CSV file in php and so far as i have understood - these kind of files can have any encoding that was ever invented by hoomans omg and so on... i guess i have a MacRoman ANSI encoded CSV, i'm working on a Mac.

So far, so good (not good at all but thats another topic).. Now, while iterating through the lines, i have a value like:

Z�rich

Obviously, it should be "Zürich" - the "ü" is missing..

Now, i have tried almost anything.. mb_detect_encoding is saying "false" so, he doesn't understand what it is...

Then i have found a genius class by Sebastian Grignoli here -> Detect encoding and make everything UTF-8

Seems nice but... all i got is:

ZŸrich

not really the "ü" i have expected :D

Now i have found out, that a "utf8_encode" will work somehow, it generates:

Z\u009Frich

but.. what now? if i put this directly in the database, the final value is "Zrich", which means it is still not really UTF-8, or is the db just struggling with the escaped variant? When i make an mb_detect_encoding on that value, he says now "UTF-8".. nice.. but how can i go further? How can i get my "Zürich" the right way in UTF-8?

Community
  • 1
  • 1
jebbie
  • 1,418
  • 3
  • 17
  • 27
  • What's the encoding on the database column storing these values? – George Brighton Sep 11 '13 at 12:29
  • change the character set to utf8 when creating the table – rams0610 Sep 11 '13 at 12:32
  • in my application, everything is utf-8, from the table, to the code to the browser - everything.. the problem here occurs when i'm reading a file that is uploaded by a user, that was created by an MS Excel on any client machine, so, the file can be encoded in everything and i have no control about that :/ (source: http://stackoverflow.com/questions/508558/what-charset-does-microsoft-excel-use-when-saving-files) – jebbie Sep 11 '13 at 18:05

2 Answers2

3

You can probably use iconv for the conversion. On my installation, the MacRoman encoding is called simply "MAC":

$city = "Z\x9frich";
$city = iconv("MAC", "UTF-8", $city); 
echo $city; // Output: Zürich
Joni
  • 108,737
  • 14
  • 143
  • 193
  • iconv is even able to directly correct the text from Z�rich to Zürich - but you have to know the encoding which is quite hard when mb_detect_encoding is always returning "false" -> so i wrote my own detect lik described here: http://php.net/manual/de/function.mb-detect-encoding.php – jebbie Sep 15 '13 at 12:26
  • It's hard to distinguish between single-byte encodings because every byte sequence is valid, unlike in variable-byte encodings. If you don't have any information of the origin of the text you'll have to make a guess based on letter or n-gram frequencies for example. – Joni Sep 15 '13 at 12:37
  • sounds interesting, currently i have implemented already a kind of wild-guessing like described in the php doc comments, but i have to google about that stuff you write ;) – jebbie Sep 15 '13 at 12:44
1

Try to convert all the file first with iconv. And import later. Or iterate every line and convert with iconv.

You must know the original codification of your file.

David
  • 1,116
  • 3
  • 18
  • 32