There're two sqlite databases A and B, both consist of 10+ million rows. The task is to read all records in B and add those new ones into A if they are not already existent there.
There're no duplicates in A, but there're plenty of duplicates in B.
I found this is a tricky job since back when both A and B only consist of about 1 million rows, all what I did was this:
list_from_A = read all rows into a list from database A
list_from_B = read all rows into a list from database B
no_dupe_list = list(set(list_from_A) - set(list_from_B))
append no_dupe_list into database A
Right now since these two databases are way too big to be read all into memory, I constantly ran into MemoryError when doing this, and actually there's only 2G memory used when this happens, while there's 16G RAM in total on my win 7 64bit, so by the way any idea to let python take full advantage of this?
Anyway, basically I have to cut database B into several parts to do the job, this is way more ineffective since there're already lots of duplicates in B and they end up being separated into different parts which is processed each time comparing against database A, so is there a faster way to do this?