Read from CSV file and add to database, exclude duplicates

Question

I have a CSV file that I am trying to load into my mysql database. The file is >10MB and has over 100,000 rows. I need an efficent way to check if a row is already in the database or not, and add new entries. It is for updating the database with the most current data from another, third party database.

It seems really inefficient to check line by line and add line by line. Also, if it helps, the CSV file has a date field.

Does the data have a unique ID or something that can be used as a unique identifier? — G-Nugget, Feb 22 '13 at 18:15
If IIN is a key you can use 'ON DUPLICATE KEY UPDATE IIN=IIN' http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html — AmazingDreams, Feb 22 '13 at 18:16
I would bulk-load it into a temp table, index it, and insert into the destination table if it couldn't find a corresponding record in the destination table. — Jeremy Holovacs, Feb 22 '13 at 18:16
@JeremyHolovacs I haven't done something like that, how would I merge them after? (What would the query look like? — Jacob Valenta, Feb 22 '13 at 18:18
Use `INSERT IGNORE` queries and have `IIN` as a unique/primary index. — G-Nugget, Feb 22 '13 at 18:18
@JacobValenta, it'd just be a INSERT INTO ... SELECT statement. It's a pretty common practice, http://dev.mysql.com/doc/refman/5.1/en/insert-select.html — Jeremy Holovacs, Feb 22 '13 at 18:19
@JacobValenta, No it would effectively do nothing, the `UPDATE IIN=IIN` is to make sure MySQL does not misinterpret it. ;) — AmazingDreams, Feb 22 '13 at 18:21
@AmazingDreams Using `INSERT IGNORE` is a much easier way to accomplish the same thing as `ON DUPLICATE KEY UPDATE` if the duplicates are just being ignore. — G-Nugget, Feb 22 '13 at 18:24
if you are looking for CSV upload library [here](http://stackoverflow.com/questions/11337221/codeigniter-rest-csv-import-to-mysql/11339125#11339125) is the one — Muhammad Raheel, Feb 22 '13 at 18:27

score 0 · Answer 1 · answered Feb 22 '13 at 18:20

First I would prepare the database table to have indexes and a unique key.

Then, I would parse the CSV line by line. In case of repetition, the database will not accept the value and you script will continue.

If your CSV doesn't have a field that is unique, you may create an hash based on all the data fields of that line and use that as the unique key. You may produce a unique key using something like:

md5(field1 + ":" + field2 + ":" + ... + ":" + fieldN)

score 0 · Answer 2 · answered Feb 22 '13 at 18:20

0

I can think of 2 simple options. Query the DB 1 time and store the entire result-set on an id-keyed array

$data[$queryResult['fieldID']] = $queryResult;

And use that to validate each field fast. Or you can use INSERT IGNORE (will ignore the warning that the field already exist and continue with the next)

answered Feb 22 '13 at 18:20

Manatax

4,083
5
30
40

The in-memory approach can be faster, but has terrible scaling characteristics and no consistency. (What if you read the whole table into memory, begin processing the file, but someone else has inserted a row?) INSERT IGNORE is one way to solve the problem, but that syntax is only available on MySQL and ignores many other errors besides duplicate key violations. – MikeB Feb 22 '13 at 18:32
the tag on his post includes MySQL.. And it doesn't appear to be a concurrent insert situation. – Manatax Feb 22 '13 at 18:34
Even in standalone situations, INSERT IGNORE ignores all errors (truncation and rounding, for example). – MikeB Feb 22 '13 at 20:45
@MikeB That is why I suggested 2 answers... it all depends on the situation. Maybe there is nothing to ignore but a duplicate field, and besides the data should be filtered/processed before it is stored. I'm automatically counting on ppl using data sanitation... Or he can use a pre-built csv module/library that generates extra overhead... or he can use the in-memory approach if this is the only method of data INPUT. No method to answer the question as asked is flawless. – Manatax Feb 23 '13 at 04:51

Read from CSV file and add to database, exclude duplicates

2 Answers2