3

I wrote a script that allows users to upload/import lots of users at once using a CSV-file. I'm using MySQL's load data local infile to make this working:

$query = "LOAD DATA LOCAL INFILE $file INTO TABLE my_table 
FIELDS TERMINATED BY $delimiter 
LINES TERMINATED BY '\\n' 
(email, name, organization);

But a user tried to import a document that contained the name Günther. This was saved to the database as "G" (cutting of the rest). The document turned out to be in latin1 causing the problems. I don't want to bother my users with character sets and stuff.

I know about the character set option that is supported by load data local infile. But, even though I don't get error when I put CHARACTER SET latin1 in my query, I want everything to be UTF-8. And What happens if I another users uses a file that's neither in UTF-8 or latin1?

So how do I found out in which character set the user uploaded document is and how do I convert it to UTF-8?

David Janssen
  • 161
  • 1
  • 6
  • 1
    That's theoretically impossible ;) Fortunately, there are some heuristics that can help you guess the correct encoding with a high probability. See e.g. this: http://stackoverflow.com/questions/9824902/iconv-any-encoding-to-utf-8 – Piskvor left the building Jun 27 '14 at 11:41

1 Answers1

1

you can find the character encoding using mb_detect_encoding before running the $query. This will help you detect the most likely encoding before you load your file.

suppose the file name is in $str

here is a basic example that might help.

<?php
/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);

/* "auto" is expanded according to mbstring.language */
echo mb_detect_encoding($str, "auto");

/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

/* Use array to specify encoding_list  */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo mb_detect_encoding($str, $ary);
?>

here is the link to php.net's mb_detect_encoding

This is just a work-around and a heuristic way. Make sure you handle all the exceptions that might incure (which might be tedious, i guess)

There is a class written that might suite your requirement (Haven't tested the code) on phpclasses.org

Guns
  • 2,678
  • 2
  • 23
  • 51
  • 2
    You should at least mention that this is only a heuristic, i.e. can go horribly wrong. – Deduplicator Jun 27 '14 at 11:45
  • Yes @Deduplicator, you are right. should make an edit right away. Thanks! – Guns Jun 27 '14 at 11:46
  • @Guns Thank you very much for your answer. Really appreciate it. But I can't help thinking that I can't be the first person to stumble upon this. As you and 'Deduplicator' point out there's much more to it than this script. Isn't there an open-source class that's working quite well? – David Janssen Jun 27 '14 at 12:55
  • @DavidJanssen, modified the answer with a link. This might help you. – Guns Jun 28 '14 at 19:33