I have an SQLite table containing some metadata about some files, which has a UNIQUE
column on the file's hash. The client I'm writing can, to a limited extent, populate this database on its own just by parsing the files in question. However, since the database is primarily used for metadata about files the client may want to download, and more importantly, since the authors of many of these files did not include metadata in them, in which case the relevant database row must be populated by humans, it also queries several user-defined servers, each of which have their own possibly-incomplete copy of the database. Each server may have some blank (NULL
) fields in each row which another server may have populated. Every column of the database has a single canonical "correct" value for any given row, so if two different copies of the same database populate the same field with different values, then one of them is corrupt.
I download a copy of the remote database, which is supplied to me as a CSV file, and load it into my database (which I already have figured out -- this is Python, so it's really as simple as
self.cursor.execute('create temporary table newinfo(a,b,c,d,e,f)')
with open('remote.csv') as f:
self.cursor.executemany('insert into newinfo values (?,?,?,?,?,?)', csv.reader(f))
The performance of that probably isn't great, but if I cared about performance I wouldn't be using Python. Any suggestions to do that faster would be welcome but are not expected. Also I should mention that I'm not actually downloading to a temporary file and parsing that -- I'm really using an io.TextIOWrapper
around the HTTPResponse
object -- I just wrote it as opening a local file for simplicity for the sake of writing this question.)
Anyway, with that tangent out of the way, once I've downloaded the remote database table, I need to merge it into my local one. I think it will be possible that the local database will have fields populated that the remote database will not, and vice versa. What I need to do is see if the fields in the remote database (which I have just downloaded into a new local table) are not NULL, and if so, copy them to the same fields in my local table. As an optional bonus, I would like to be able to detect if the remote and local databases have any fields that are both populated and do not match, and if so, abort the transaction and signal the error to the user.
My question is: what is the most efficient way of going about this?