0

I have a CSV file that I am trying to load into my mysql database. The file is >10MB and has over 100,000 rows. I need an efficent way to check if a row is already in the database or not, and add new entries. It is for updating the database with the most current data from another, third party database.

It seems really inefficient to check line by line and add line by line. Also, if it helps, the CSV file has a date field.

j08691
  • 204,283
  • 31
  • 260
  • 272
Jacob Valenta
  • 6,659
  • 8
  • 31
  • 42

2 Answers2

0

First I would prepare the database table to have indexes and a unique key.

Then, I would parse the CSV line by line. In case of repetition, the database will not accept the value and you script will continue.

If your CSV doesn't have a field that is unique, you may create an hash based on all the data fields of that line and use that as the unique key. You may produce a unique key using something like:

md5(field1 + ":" + field2 + ":" + ... + ":" + fieldN)

racribeiro
  • 103
  • 6
0

I can think of 2 simple options. Query the DB 1 time and store the entire result-set on an id-keyed array

$data[$queryResult['fieldID']] = $queryResult;

And use that to validate each field fast. Or you can use INSERT IGNORE (will ignore the warning that the field already exist and continue with the next)

Manatax
  • 4,083
  • 5
  • 30
  • 40
  • The in-memory approach can be faster, but has terrible scaling characteristics and no consistency. (What if you read the whole table into memory, begin processing the file, but someone else has inserted a row?) INSERT IGNORE is one way to solve the problem, but that syntax is only available on MySQL and ignores many other errors besides duplicate key violations. – MikeB Feb 22 '13 at 18:32
  • the tag on his post includes MySQL.. And it doesn't appear to be a concurrent insert situation. – Manatax Feb 22 '13 at 18:34
  • Even in standalone situations, INSERT IGNORE ignores all errors (truncation and rounding, for example). – MikeB Feb 22 '13 at 20:45
  • @MikeB That is why I suggested 2 answers... it all depends on the situation. Maybe there is nothing to ignore but a duplicate field, and besides the data should be filtered/processed before it is stored. I'm automatically counting on ppl using data sanitation... Or he can use a pre-built csv module/library that generates extra overhead... or he can use the in-memory approach if this is the only method of data INPUT. No method to answer the question as asked is flawless. – Manatax Feb 23 '13 at 04:51