I am importing utf8 data from MySql to HDFS using sqoop import. It works fine but facing issue when the data is utf-8. The source MySql table is utf-8 compatible but looks like sqoop is converting the data during import. Example - The source value - л.с. is loaded as л.Ñ. to HDFS.
Currently, Mysql (v5.6.10) character set & collation given below :
+--------------------------+-----------------------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_unicode_ci |
+--------------------------+-----------------------------------------+
-- Table Structure
CREATE TABLE utf_test_cases_ms
(
test_case varchar(50) NOT NULL,
english_lang varchar(250) NOT NULL,
language_name varchar(50) NOT NULL,
utf8_lang varchar(300) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ;
-- Mysql
select * from utf_test_cases_ms;
+--------------------+--------------+---------------+-----------+
| test_case | english_lang | language_name | utf8_lang |
+--------------------+--------------+---------------+-----------+
| Multiple Character | hp | Russian | л.с. |
+--------------------+--------------+---------------+-----------+
-- Sqoop Import Command
sqoop import --connect jdbc:mysql://<<IP_ADDRESS_WITH_PORT>>/<<DB_NAME>>
--table utfmb_test_cases_ms --username sqoop_user --password sqoop_pwd
--hive-import --hive-table utf_ms_db.utfmb_test_cases_ms
--create-hive-table --null-string '\\N' --null-non-string '\\N'
--fields-terminated-by '|' --lines-terminated-by '\n' -m 1
-- Hive (HDFS)
select * from utfmb_test_cases_ms;
Multiple Character hp Russian л.Ñ.
Do I need to change the character set & collation in Mysql Config file ? Do I need to pass an extra unicode / utf8 parameters while importing data via sqoop ?
Please provide a solution for this. Thanks in advance!
(from comment)
CREATE TABLE utf_test_cases_ms (
test_case varchar(50) NOT NULL,
english_lang varchar(250) NOT NULL,
language_name varchar(50) NOT NULL,
utf8_lang varchar(300) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ;