Write MapReduce ETL job

Question

I have a table of ~300GB of data in mysql. I want to add new column in table. When I alter table using alter command, it never completes and process dies. So I was planning on writing SQOOP job to get all data from table and dump it into HDFS and then create new DB in mysql and create table with additional column and re-import data from HDFS again.

table structure:
CREATE TABLE `nodes` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `type` int(11) NOT NULL,
  `name` varchar(50) NOT NULL,
  `postcode` varchar(4) DEFAULT NULL,
  `updated` datetime DEFAULT NULL,
  `db_updated` datetime DEFAULT CURRENT_TIMESTAMP,
  `useragent` bigint(20) DEFAULT NULL,
  `last_seen` date DEFAULT NULL, --newly added column
  PRIMARY KEY (`id`),
  UNIQUE KEY `akaid_index` (`type`,`name`),
  KEY `useragent_idx` (`useragent`),
  KEY `type` (`type`),
  CONSTRAINT `useragentfk` FOREIGN KEY (`useragent`) REFERENCES `useragents` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=1091725696 DEFAULT CHARSET=latin1;

SQOOP command:

sqoop export --connect jdbc:mysql://localhost:3306/graph1 --table nodes --username root --password password --export-dir <dir-path> --input-fields-terminated-by ','

I'm getting below error:

Error: java.io.IOException: Can't export data, please check failed map task logs
    at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
    at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.RuntimeException: Can't parse input data: 'NULL'
    at nodes.__loadFromFields(nodes.java:470)
    at nodes.parse(nodes.java:388)
    at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
    ... 10 more
Caused by: java.lang.NumberFormatException: For input string: "NULL"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Long.parseLong(Long.java:441)
    at java.lang.Long.valueOf(Long.java:540)
    at nodes.__loadFromFields(nodes.java:467)
    ... 12 more

File in HDFS contains following records:

1289603991,1,fee1cee723bdb0bc499c443765b40e3d,,2016-04-13 10:19:59,2016-04-14 03:44:55,5296252
1289603992,1,edf65c2e7b89388fe9068cc3a898a3fd,,2016-04-13 10:20:00,2016-04-14 03:44:55,5411481
1289603993,1,5760fd1cca92a65ce6f2db43853fc118,,2016-04-13 10:19:59,2016-04-14 03:44:55,4441745
1289603994,1,65dd92c80df5581f55bc60f3e997ec05,,2016-04-13 10:19:59,2016-04-14 03:44:55,5332084
1289603995,1,7654a84428f3064828f5972cfce5f8e6,,2016-04-13 10:20:00,2016-04-14 03:44:55,5202243
1289603996,1,84c270212fe5f3a52cb2bd75403da058,,2016-04-13 10:20:00,2016-04-14 03:44:55,5398729
1289603997,1,a486382c4fc296a5e8d3c0491568c22c,,2016-04-13 10:19:57,2016-04-14 03:44:55,5289170
1289603998,112,2_3Nns7YXPmS_xv3imJBiw04BQf1sNc2tJrtFJ5TCx98,,2016-04-13 10:20:00,2016-04-14 03:44:55,NULL
1289603999,1,a3607df77e025b12c1728f62589857fa,,2016-04-13 10:19:59,2016-04-14 03:44:55,12
1289604000,113,570e1d4e6372cd9c,,2016-04-13 10:19:59,2016-04-14 03:44:55,NULL
1289604001,113,57023dd016258fbf,,2016-04-13 10:20:00,2016-04-14 03:44:55,NULL

After adding --input-null-string and --input-null-non-string sqoop job works but now it fails for following reason:

2016-06-07 18:11:37,750 ERROR [Thread-9] org.apache.sqoop.mapreduce.AsyncSqlOutputFormat: Got exception in update thread: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry '1289603991' for key 'PRIMARY'
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
    at com.mysql.jdbc.Util.getInstance(Util.java:387)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:934)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3966)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3902)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2526)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2673)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
    at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1192)
    at org.apache.sqoop.mapreduce.AsyncSqlOutputFormat$AsyncSqlExecThread.run(AsyncSqlOutputFormat.java:233)

Since table has AUTO_INCREMENT column I am getting above error. Is there any other way to alter this table and make it fast? As per this thread(Changing Large MySQL InnoDB Tables) it seems large InnoDB tables takes more time even days.

If you can suggest any alternate suggestion it would be good.

54l3d · Accepted Answer · 2016-06-07T12:57:09.327

1

Your data violate the primary key constraint:

If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints.

You need to truncate your table before executing exporting command, or to update existing data using this argument :

--update-key id

edited Jun 07 '16 at 12:57

answered Jun 07 '16 at 10:21

54l3d

3,913
4
32
58

Thanks, I tried that and it worked but since table has AUTO_INCREMENT column (id), I am not able to load it back to mysql table. So I am getting new error now. – hlagvankar Jun 07 '16 at 12:17
@hlagvankar So you have to replace the first column by `NULL` so your dbms can increment that field. – 54l3d Jun 07 '16 at 12:19
Sorry I didn't get you. This table dump is taken every week (including id column), so how can I ingest data into HDFS by making id NULL and then reloading it to new database and in new table (same structure as old table)? – hlagvankar Jun 07 '16 at 12:30
You are right, can you post the new error, we can discuss it separately. – 54l3d Jun 07 '16 at 12:37
I will be creating new database in mysql with new table having same structure as old one and it would be empty. So during loading do I need to mention --update-key parameter? – hlagvankar Jun 07 '16 at 13:11
@hlagvankar If your table is empty and your data respect table constraint, no need to use `--update-key` – 54l3d Jun 07 '16 at 13:30

Write MapReduce ETL job

1 Answers1