1

When using LOAD DATA INFILE, is there a way to either flag a duplicate row, or dump any/all duplicates into a separate table?

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
mike
  • 8,041
  • 19
  • 53
  • 68
  • 1
    Related: http://stackoverflow.com/questions/1506511/is-it-possible-to-use-a-load-data-infile-type-command-to-update-rows-in-the-db – OMG Ponies Dec 27 '09 at 03:08

2 Answers2

14

From the LOAD DATE INFILE documentation:

The REPLACE and IGNORE keywords control handling of input rows that duplicate existing rows on unique key values:

  • If you specify REPLACE, input rows replace existing rows. In other words, rows that have the same value for a primary key or unique index as an existing row. See Section 12.2.7, “REPLACE Syntax”.
  • If you specify IGNORE, input rows that duplicate an existing row on a unique key value are skipped. If you do not specify either option, the behavior depends on whether the LOCAL keyword is specified. Without LOCAL, an error occurs when a duplicate key value is found, and the rest of the text file is ignored. With LOCAL, the default behavior is the same as if IGNORE is specified; this is because the server has no way to stop transmission of the file in the middle of the operation.

Effectively, there's no way to redirect the duplicate records to a different table. You'd have to load them all in, and then create another table to hold the non-duplicated records.

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
  • I wonder if it would be possible to create a BEFORE INSERT trigger that queries the table to see if the ID of the insert exists, and then inserts it into another table if it does. It'd be slow as heck, but it could work ... (BEFORE / AFTER INSERT triggers gets called for every row in a LOAD DATA query). – cmptrgeekken Feb 27 '10 at 04:29
  • Hello friends, I have a table with 7 columns, the primary key does not work for me very much for this query, since it is based on columns and values, I want to insert or update a single column from a csv file; The problem is that Duplicate is not used correctly, and if possible for this scenario: if in a row three of the columns A, B, C match their values (already exists a record) do the update; If there is no match make an insert in the queue. –  Nov 09 '16 at 21:52
4

It looks as if there actually is something you can do when it comes to duplicate rows for LOAD DATA calls. However, the approach that I've found isn't perfect: it acts more as a log for all deletes on a table, instead of just for LOAD DATA calls. Here's my approach:

Table test:

CREATE TABLE test (
    id INTEGER PRIMARY KEY,
    text VARCHAR(255) DEFAULT NULL
);

Table test_log:

CREATE TABLE test_log (
    id INTEGER, -- not primary key, we want to accept duplicate rows
    text VARCHAR(255) DEFAULT NULL,
    time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Trigger del_chk:

delimiter //
drop trigger if exists del_chk;
CREATE TRIGGER del_chk AFTER DELETE ON test
FOR EACH ROW
BEGIN
    INSERT INTO test_log(id,text) values(OLD.id,OLD.text);
END;//
delimiter ;

Test import (/home/user/test.csv):

1,asdf
2,jkl
3,qwer
1,tyui
1,zxcv
2,bnm

Query:

LOAD DATA INFILE '/home/ken/test.csv'
REPLACE INTO TABLE test 
FIELDS 
    TERMINATED BY ','
LINES
    TERMINATED BY '\n' (id,text);

Running the above query will result in 1,asdf, 1,tyui, and 2,jkl being added to the log table. Based on a timestamp, it could be possible to associate the rows with a particular LOAD DATA statement.

cmptrgeekken
  • 8,052
  • 3
  • 29
  • 35
  • Hello friends, I have a table with 7 columns, the primary key does not work for me very much for this query, since it is based on columns and values, I want to insert or update a single column from a csv file; The problem is that Duplicate is not used correctly, and if possible for this scenario: if in a row three of the columns A, B, C match their values (already exists a record) do the update; If there is no match make an insert in the queue. –  Nov 09 '16 at 21:52