2

There're two sqlite databases A and B, both consist of 10+ million rows. The task is to read all records in B and add those new ones into A if they are not already existent there.

There're no duplicates in A, but there're plenty of duplicates in B.

I found this is a tricky job since back when both A and B only consist of about 1 million rows, all what I did was this:

list_from_A = read all rows into a list from database A
list_from_B = read all rows into a list from database B

no_dupe_list = list(set(list_from_A) - set(list_from_B))

append no_dupe_list into database A

Right now since these two databases are way too big to be read all into memory, I constantly ran into MemoryError when doing this, and actually there's only 2G memory used when this happens, while there's 16G RAM in total on my win 7 64bit, so by the way any idea to let python take full advantage of this?

Anyway, basically I have to cut database B into several parts to do the job, this is way more ineffective since there're already lots of duplicates in B and they end up being separated into different parts which is processed each time comparing against database A, so is there a faster way to do this?

Shane
  • 4,875
  • 12
  • 49
  • 87
  • How would you check records are duplicated? Do you check more than one field? I would suggest to read only main fields and hash them. Like listA has two fields id and hash key for the fields which you like to check against listB, do the same thing for listB, then compare hash keys – Azadeh Khojandi Apr 25 '13 at 03:47
  • @AzadehKhojandi: Actually there's only one field need to check. And I don't think hash key would help either, since it won't solve "out of memory" problem, which means I still have to separate database B into lots of parts to finish the job. – Shane Apr 25 '13 at 05:08
  • Since DB cannot cross the 2GB RAM boundary, you'd have to use file method. Create two files, file-A and file-B, that contains [value] and [key]. Sort each file by [Value]. Problem is reduced to reading one line at a time from both files to find the new records. – Alvin K. Apr 25 '13 at 07:35
  • @AlvinK.:I guess the whole problem boils down to a way to effectively manage memory. Even if you create two files, they are still way too large to be fully read into memory, therefore separating the files into several pieces may be a must? There gotta be a better way to do all this, either to take full advantage of large system RAM, or a better algorithm. – Shane Apr 25 '13 at 14:29
  • @Shane: Check SO answer to [how-to-read-large-file-line-by-line-in-python](http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python), google for nore related answers – Alvin K. Apr 25 '13 at 17:59

1 Answers1

0

It is possible to make cursors for ordered SELECT's in both databases, and go 'parallel' through cursors records (like in merge sort) and insert missing rows in database A. It is the best to first insert them in temporary table(s) not to confuse cursor and with that there is no need of memory usage (except 2 rows.)

Like:

ca = db_a.cursor("SELECT ....")
cb = db_b.cursor("SELECT ....")
ra = ca.fetch_row()
rb = cb.fetch_row()
while ra and rb:
  if ra <= rb:
    last_row_in_new_A = ra
    ra = ca.fetch_row()
  else:  # rb < ra
    if rb != last_row_in_new_A:  # Removes possible duplicates
      insert row rb
      last_row_in_A = rb
    rb = cb.fetch_row()
# Insert remaining rows from B
while rb:
  if rb != last_row_in_new_A:
    insert row rb
    last_row_in_new_A = rb
  rb = cb.fetch_row()
Ante
  • 5,350
  • 6
  • 23
  • 46