0

i have a MySQL database having standard encoding and server encoding all set as utf8.I have csv files coming in of multiple encoding which I have to load in the database using jdbc. But when the incoming file is of encoding ANSII, load data infile fails

java.sql.SQLException: Invalid utf8 character string: '1080'

I am creating a table table_abc based on csv headers and then using the below query to load the csv file into database

LOAD DATA LOCAL INFILE 'XXX.csv' INTO TABLE table_abc CHARACTER SET UTF8  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES

Here is my DB definition

character_set_client    utf8
character_set_connection    utf8
character_set_database  utf8
character_set_filesystem    binary
character_set_results   utf8
character_set_server    utf8
character_set_system    utf8
character_sets_dir  C:\Program Files\MySQL\MySQL Server 5.7\share\charsets\

What should I do now,

  1. Should i convert all files to utf8 before uploading? if yes then how in Java
  2. Should I have multiple encoded tables for multiple encoded files? If yes, then how do i detect encoding of incoming file in java?

P.S I have no issues in missing out non-utf8 characters while loading in the table, my only intention is the sucessful upload of the file in the DB without giving any error irrespective of encoding.

Thanks

AngryLeo
  • 390
  • 4
  • 23
  • Without seeing the code that reads the file, the JDBC code, and a definition of your database table, it is difficult to understand your problem. Are you uploading the entire CSV file as a single text value? A binary blob? – VGR Jan 31 '17 at 12:58
  • I have edited the question, Please look at it. :) – AngryLeo Jan 31 '17 at 13:06
  • Are you asking whether the best practice would be to change the encoding of the file, or change the `CHARACTER SET` portion of your SQL? – VGR Jan 31 '17 at 13:10
  • Yes..If I can change the encoding of the csv file, That would serve the purpose.But I dont know the present encoding of the file beforehand – AngryLeo Jan 31 '17 at 13:27
  • 1
    The source of each file (person or program) knows what its encoding is; that is pretty much the only way to know a file’s encoding. You cannot auto-detect a file’s charset, unless it starts with a byte order mark character (which most don’t); see http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream . – VGR Jan 31 '17 at 14:51

1 Answers1

0

If you mean that some columns are utf8 and some columns are, say, latin1, then it gets a bit complicated, but still possible.

Create a "staging" table to put the data into from the LOAD. But have all the VARCHAR columns be VARBINARY and TEXT be BLOB. This way the data bytes will be loaded unchanged.

Then ALTER that table to convert the binary/blob columns to the suitable varchar/text types:

ALTER ...
    MODIFY COLUMN col1 VARCHAR(111) CHARACTER SET ... COLLATION ...,
    MODIFY COLUMN col2 TEXT CHARACTER SET ... COLLATION ...,
    ...;

Then copy the data over to your 'real' table (unless this table is sufficient).

If one column has a mixture of encodings, you are SOOL.

Identifying a charset

Provide a sample or two of the HEX of non-English characters in the column; I can usually spot what it is. This gives some clues of how to recognize a charset from hex samples.

Rick James
  • 135,179
  • 13
  • 127
  • 222