0

I am importing utf8 data from MySql to HDFS using sqoop import. It works fine but facing issue when the data is utf-8. The source MySql table is utf-8 compatible but looks like sqoop is converting the data during import. Example - The source value - л.с. is loaded as л.Ñ. to HDFS.

Currently, Mysql (v5.6.10) character set & collation given below :

+--------------------------+-----------------------------------------+
| Variable_name            | Value                                   |
+--------------------------+-----------------------------------------+
| character_set_client     | latin1                                  |
| character_set_connection | latin1                                  |
| character_set_database   | latin1                                  |
| character_set_filesystem | binary                                  |
| character_set_results    | latin1                                  |
| character_set_server     | utf8                                    |
| character_set_system     | utf8                                    |
| collation_connection     | latin1_swedish_ci                       |
| collation_database       | latin1_swedish_ci                       |
| collation_server         | utf8_unicode_ci                         |
+--------------------------+-----------------------------------------+

-- Table Structure
CREATE TABLE utf_test_cases_ms 
(
  test_case varchar(50) NOT NULL,
  english_lang varchar(250) NOT NULL,
  language_name varchar(50) NOT NULL,
  utf8_lang varchar(300) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ;

-- Mysql
select * from utf_test_cases_ms;
+--------------------+--------------+---------------+-----------+
| test_case          | english_lang | language_name | utf8_lang |
+--------------------+--------------+---------------+-----------+
| Multiple Character | hp           | Russian       | л.с.    |
+--------------------+--------------+---------------+-----------+

-- Sqoop Import Command
sqoop import --connect jdbc:mysql://<<IP_ADDRESS_WITH_PORT>>/<<DB_NAME>> 
--table utfmb_test_cases_ms --username sqoop_user --password sqoop_pwd 
--hive-import --hive-table utf_ms_db.utfmb_test_cases_ms 
--create-hive-table --null-string '\\N' --null-non-string '\\N'  
--fields-terminated-by '|'  --lines-terminated-by '\n' -m 1 

-- Hive (HDFS)
select * from utfmb_test_cases_ms;
Multiple Character  hp  Russian л.Ñ.

Do I need to change the character set & collation in Mysql Config file ? Do I need to pass an extra unicode / utf8 parameters while importing data via sqoop ?

Please provide a solution for this. Thanks in advance!

(from comment)

CREATE TABLE utf_test_cases_ms (
    test_case varchar(50) NOT NULL, 
    english_lang varchar(250) NOT NULL, 
    language_name varchar(50) NOT NULL, 
    utf8_lang varchar(300) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ; 
Rick James
  • 135,179
  • 13
  • 127
  • 222
M S
  • 1
  • 1
  • 2

2 Answers2

0

Pass character-set parameter in sqoop command

sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1
marjun
  • 696
  • 5
  • 17
  • 30
0

Perhaps @marjun's suggestion should involve utf8?

Anyway, л.с. coming out as л.Ñ. is "Mojibake" as discussed in http://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored

Please provide SHOW CREATE TABLE so we can verify that that was set correctly.

If the connection parameters don't work, use SET NAMES utf8.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • SHOW CREATE TABLE utf_test_cases_ms for UTF8 in MySQL CREATE TABLE utf_test_cases_ms ( test_case varchar(50) NOT NULL, english_lang varchar(250) NOT NULL, language_name varchar(50) NOT NULL, utf8_lang varchar(300) NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 ; Tried with SET NAMES utf8, but issue still persist. – M S Sep 13 '17 at 08:39
  • `л.с.` cannot be `latin1`; you must tell MySQL that your data is utf8. The `SET NAMES` must be executed after connecting and before `INSERTing`. – Rick James Sep 13 '17 at 14:51