0

Does the performance of LOAD DATA LOCAL INFILE will come down when we use UNIQUE Key Column as to avoid the duplicates?

When i test a .csv file import with Unique and without Unique Key columns, I have following observations:

  1. I see 1 Million records stores/inserts into the table without having any issue without using the Unique key constraint, So it takes duplicates.
  2. With Unique Key constraint on a column to avoid duplicates, I tested with 70k records, it stored well in a minutes time where as for half million records are not inserting/storing into the table and it is running hours of hours through web or MySQL Workbench. Some times i see following issue:

Error Code: 1205. Lock wait timeout exceeded; try restarting transaction Query Used:

LOAD DATA LOCAL INFILE 'file.csv' ignore
    INTO TABLE table1
    CHARACTER SET UTF8mb4 FIELDS
    TERMINATED BY ','
    OPTIONALLY ENCLOSED BY '"'
    LINES TERMINATED BY '\r\n';`

Table structure:

CREATE TABLE table1 ( Idint(11) NOT NULL AUTO_INCREMENT, cUrlvarchar(255) DEFAULT NULL, SUrlvarchar(255) DEFAULT NULL, FirstNamevarchar(100) DEFAULT NULL, LastNamevarchar(100) DEFAULT NULL, EmailIdvarchar(150) NOT NULL, Filtervarchar(55) DEFAULT NULL, Designationvarchar(255) DEFAULT NULL, Companyvarchar(255) DEFAULT NULL, WebSitevarchar(255) DEFAULT NULL, sizevarchar(25) DEFAULT NULL, Industryvarchar(255) DEFAULT NULL, Typevarchar(25) DEFAULT NULL, Foundedvarchar(5) DEFAULT NULL, datevarchar(55) NOT NULL, PRIMARY KEY (Id), UNIQUE KEYEmailId(EmailId`) ) ENGINE=InnoDB AUTO_INCREMENT=16812 DEFAULT CHARSET=UTF8mb4;

Applied following steps too:

set names utf8mb4;
set unique_checks = 0;
set foreign_key_checks = 0;
set sql_log_bin=0;

mysqld select version();
+------------+
| version()  |
+------------+
| 5.7.18-log |
+------------+
1 row in set (0.00 sec)

SELECT @@innodb_buffer_pool_size;
+---------------------------+
| @@innodb_buffer_pool_size |
+---------------------------+
|                   8388608 |
+---------------------------+
1 row in set (0.00 sec)

I fixed the UTF8mb4 issue but with Unique key constraint i am failing to import the csv file of half million in the test. I have to work with 40 million records but ofcourse in chucks i will do.

Is it not efficient to use LOAD DATA LOCAL INFILE to import a large csv and with Unique Key constraint.

Suggestions and guidance will be appreciable.

Rick James
  • 135,179
  • 13
  • 127
  • 222
UM1979
  • 162
  • 3
  • 15
  • I would remove UNIQUE KEY EmailId and if your import could have duplicate id, I would remove Primary key on the first field too. Having indexes slows down dramatically large imports. Much faster to add indexes after import is finished. And it's probably easier to manage duplicates ounce import is done. – Antony Gibbs Jun 03 '17 at 19:43
  • @AntonyGibbs : Thank you for your guidance. I removed Unique Key EMail Id, The data loads faster but when i try for the duplicates to find, It is taking very long time. If i remove the Primary key on the first field too then i may not have reference to remove the duplicates. Or i may need to use multiple temporary tables to do these operations after final result needs to be pushed to master/main table. Not sure but need to try. – UM1979 Jun 04 '17 at 05:13
  • ounce you have imported the data, then create the indexes. If you have duplicates on the primary key, you'll need to add an extra id field with primary key and autoincrement. For the email, only add an index (not unique) and finding duplicates will be much faster. – Antony Gibbs Jun 05 '17 at 13:50

2 Answers2

0

Does the performance of LOAD DATA LOCAL INFILE will come down when we use UNIQUE Key Column as to avoid the duplicates?

Yes, LOAD DATA creates non unique indexes in batches to increase performance. Updating the index for each CREATE/DELETE/UPDATE operation on a table adds an extra overhead to those operations. That's why you find your inserts really slow when you have two many indexes.

Index updates for each row cannot be avoided for unique indexes because the next row you insert might be a duplicate.

Still you have no choice here. Loading data by switching off the unique index and then deleting duplicates afterwards will be much much much slower. However there are things you can do to speed things up

Reduce size on disk

You have lot's of varchar columns. I have a feeling that founded, date and size are integer columns. If yes, they should be created as integer columns. They take up a lot less space. The biggest bottleneck in the LOAD DATA is IO ops. Why make things harder.

Is emailId an email address or a numeric id? if it's an numeric. It two should be marked as integer

Drop the primary key!

It seems redundant to have the emailId as unique while having that auto increment primary key. Drop it and make emailId the primary key. More so if emailid is numeric.

Normalize your Data

Your fundamental problem seems to be that your data isn't normalized. If you normalize it the CSV dump could be much smaller and the data on disk also could be much smaller giving you big boost.

e4c5
  • 52,766
  • 11
  • 101
  • 134
  • Thank you for your response, EmailId is not numeric. Even if i drop EmailId as Unique, To identify the duplicates it is taking of lot of time to get the result. Moreover, The everyday i get a new csv with different data which needs to be imported which means the actual database gets increased everyday, So i may need to do 3 operations, LOAD DATA, Identify duplicates, Remove them so it may be cumbersome operation. I am not getting how to make it efficient manner to get results in time. – UM1979 Jun 04 '17 at 05:10
  • No where in my answer do I say you should drop the unique on emailId. I am actually telling you the opposite. – e4c5 Jun 04 '17 at 05:18
  • Loading without the unique constraint and then searching for duplicates is going to be much much much slower. – e4c5 Jun 04 '17 at 05:19
  • I've ran into an issues last week and i came around this answer https://stackoverflow.com/questions/9819271/why-is-mysql-innodb-insert-so-slow that states that InnoDB have a really hard time with non-numerical unique keys. From my personnal experirence, updating and alpha-numeric unique key took more than 20ms on innoDB and less than 2ms with myIsam. Flushing the unique alpha numeric key also resolved the issue. – Louis Loudog Trottier Jun 05 '17 at 01:19
0
  • Set innodb_buffer_pool_size to about 70% of available RAM. 8G is ridiculously small. Did you upgrade from a previous version where that was the default? I'm pretty sure 5.7.18 has a bigger default, but no big enough.
  • If practical, sort the csv file by email before doing the LOAD. In *nix, it is a pretty simple sort command. This will avoid having to jump around finding to validate the uniqueness for each row.
  • Will you have any other indexes? Do you need id? If no to both, then make email be the PRIMARY KEY and get rid of id.
  • Use reasonable datatypes for the various columns. Else you will have SQL issues later. (Or perhaps you need to bring in the values as strings, then cleanse them?)
  • Normalize (I agree with @e4c5).
Rick James
  • 135,179
  • 13
  • 127
  • 222