2

I have created a system using PHP/MySQL that downloads a large XML dataset, parses it and then inserts the parsed data into a MySQL database every week.

This system is made up of two databases with the same structure. One is a production database and one is a temporary database where the data is parsed and inserted into first.

When the data has been inserted into the temporary database I perform a merge by inserting/replacing the data in the production database. I have done all of the above so far. I then realised, data that might have been removed in a new dataset will be left to linger in the production database.

I need to perform a check to see if the new data is still in the production database, if it is then leave it, if it isn't delete the row from the production database so that the rows aren't left to linger.

For arguments sake, let's say the two databases are called database_temporary and database_production.

How can I go about doing this?

jskidd3
  • 4,609
  • 15
  • 63
  • 127
  • How are you merging the data? If via SQL, A simple SQL can do the delete (`delete from table where pk not in (select pk from temp)`) – Nivas May 27 '14 at 14:56
  • @Nivas Sorry, I should have said, yes it is using SQL. If it's as simple as that maybe you could post it as an answer? Thanks :D – jskidd3 May 27 '14 at 14:57
  • If you're going to use a NOT IN clause, probably you should use a timestamp column for performance. It should check only the rows that you didn't check before. – fmgonzalez May 27 '14 at 15:00

1 Answers1

4

If you are using SQL to merge, a simple SQL can do the delete as well:

delete from database_production.table
where pk not in (select pk from database_temporary.table)

Notes:

An example not exists:

delete from database_production.table p
where not exists (select 1 from database_temporary.table t where t.pk = p.pk)

Performance Notes:
As pointed out by @mgonzalez in the comments on the question, you may want to use a timestamp column (something like last modified) for comparing/merging in general so that you vompare only changed rows. This does not apply to the delete specifically, you cannot use timestamp for the delete because, well, the row would not exist.

Community
  • 1
  • 1
Nivas
  • 18,126
  • 4
  • 62
  • 76
  • Could you provide a `not exists` as example as well please? I suppose it depends how large, the databases in question both consist of around 20 million rows each – jskidd3 May 27 '14 at 15:02