1

I'm facing a challenge that has never come up for me before and having trouble finding an efficient solution. (Likely because I'm not a trained programmer and don't know all the terminology).

The challenge:

I have a feed of data which I need to use to maintain a mysql database each day. To do this requires checking if a record exists or not, then updating or inserting accordingly.

This is simple enough by itself, but running it for thousands of records -- it seems very inefficient to do a query for each record to check if it already exists in the database.


Is there a more efficient way than looping through my data feed and running an individual query for each record? Perhaps a way to somehow prepare them into one larger query (assuming that is a more efficient approach).

I'm not sure a code sample is needed here, but if there is any more information I can provide please just ask! I really appreciate any advice.


Edits:

@Sgt AJ - Each record in the data feed has a number of different columns, but they are indexed by an ID. I would check against that ID in the database to see if a record exists. In this situation I'm only updating one table, albeit a large table (30+ columns, mostly text).

turnfire
  • 35
  • 5
  • 1
    Without knowing more about the data or how your database is organized makes it tough to really give you any detailed answer. If you can group the data feed into logical groups that you could also use in a query, perhaps you could load portions of your database into an array, check all data against that array, and do whatever updates you need, then rewrite the array with the next portion of records from your database and repeat for the next logical group of data from your feed. – Sgt AJ Jun 05 '16 at 00:45
  • Thanks for the comment Sgt! I've updated the thread with a bit more info for you. In terms of preloading an array of ID's from the database to compare with the feed -- there could be hundreds of thousands of records, while the feed would only contain a smaller sample of that. – turnfire Jun 05 '16 at 00:58

2 Answers2

1

What is the problem;

if problem is performance for checking, inserting & updating;

insert into your_table
(email, country, reach_time)
values ('mike@gmail.com','Italy','2016-06-05 00:44:33')
on duplicate key update reach_time = '2016-06-05 00:44:33';

I assume that, your key is email


Old style, dont use

if email exists

update your_table set
reach_time = '2016-06-05 00:44:33'
where email = 'mike@gmail.com';

else

insert into your_table
(email, country, reach_time)
values ('mike@gmail.com','Italy','2016-06-05 00:44:33')
  • thanks for your answer! I just did a bit of research on the 'duplicate key' and now it makes sense to me. That's exactly what I'm looking for =) Much more efficient than the if/else I would have used, and nicer in code. A few more things to learn then I'll likely check this off as the best answer. Cheers! – turnfire Jun 05 '16 at 01:12
0

It depends on how many 'feed' rows you have to load. If it's like 10 then doing them record by record (as shown by mustafayelmer) is probably not too bad. Once you go into the 100 and above region I would highly suggest to use a set-based approach. There is some overhead in creating and loading the staging table, but this is (very) quickly offset by the reduction of queries that need to be executed and the amount of round-trips going on over the network.

In short, what you'd do is :

-- create new, empty staging table
SELECT * INTO stagingTable FROM myTable WHERE 1 = 2

-- adding a PK to make JOIN later on easier
ALTER TABLE stagingTable ADD PRIMARY KEY (key1)

-- load the data either using INSERTS or using some other method
-- [...] 

-- update existing records
UPDATE myTable
   SET field1 = s.field1,
       field2 = s.field2,
       field3 = s.field3
  FROM stagingTable s
 WHERE s.key1 = myTable.key1

-- insert new records
INSERT myTable (key1, field1, field2, field3)
SELECT key1, field1, field2, field3
  FROM stagingTable new
 WHERE NOT EXISTS ( SELECT * 
                      FROM myTable old
                     WHERE old.key1 = new.key1 )

-- get rid of staging table again
DROP TABLE stagingTable

to bring your data up to date.

Notes:

  • you might want to make the name of the stagingTable 'random' to avoid the situation where 2 'loads' are running in parallel and might start re-using the same table giving all kinds of weird results (and errors). Since all this code is 'generated' in php anyway you can simply add a timestamp or something to the tablename.

  • on MSSQL I would load all the data in the staging table using a bulk-insert mechanism. It can use bcp or BULK INSERT; .Net actually has the SqlBulkCopy class for this. Some quick googling shows me mysql has mysqlimport if you don't mind writing to a temp-file first and then loading from there, or you could use this to do big INSERT blocks rather than one by one. I'd avoid doing 10k inserts in one go though, rather do them per 100 or 500 or so, you'll need to test what's most efficient.

PS: you'll need to adapt my syntax a bit here and there, like I said I'm more familiar with MSSQLs T-SQL dialect. Also, it could be you can use the on duplicate key methodology on the staging table direclty, thus combining the UPDATE and INSERT in one command. [MSSQL uses MERGE for this, but it would look completely different so I won't bother to include that here.]

Good luck.

Community
  • 1
  • 1
deroby
  • 5,902
  • 2
  • 19
  • 33