0

I have an SQLite table containing some metadata about some files, which has a UNIQUE column on the file's hash. The client I'm writing can, to a limited extent, populate this database on its own just by parsing the files in question. However, since the database is primarily used for metadata about files the client may want to download, and more importantly, since the authors of many of these files did not include metadata in them, in which case the relevant database row must be populated by humans, it also queries several user-defined servers, each of which have their own possibly-incomplete copy of the database. Each server may have some blank (NULL) fields in each row which another server may have populated. Every column of the database has a single canonical "correct" value for any given row, so if two different copies of the same database populate the same field with different values, then one of them is corrupt.

I download a copy of the remote database, which is supplied to me as a CSV file, and load it into my database (which I already have figured out -- this is Python, so it's really as simple as

self.cursor.execute('create temporary table newinfo(a,b,c,d,e,f)')
with open('remote.csv') as f:
    self.cursor.executemany('insert into newinfo values (?,?,?,?,?,?)', csv.reader(f))

The performance of that probably isn't great, but if I cared about performance I wouldn't be using Python. Any suggestions to do that faster would be welcome but are not expected. Also I should mention that I'm not actually downloading to a temporary file and parsing that -- I'm really using an io.TextIOWrapper around the HTTPResponse object -- I just wrote it as opening a local file for simplicity for the sake of writing this question.)

Anyway, with that tangent out of the way, once I've downloaded the remote database table, I need to merge it into my local one. I think it will be possible that the local database will have fields populated that the remote database will not, and vice versa. What I need to do is see if the fields in the remote database (which I have just downloaded into a new local table) are not NULL, and if so, copy them to the same fields in my local table. As an optional bonus, I would like to be able to detect if the remote and local databases have any fields that are both populated and do not match, and if so, abort the transaction and signal the error to the user.

My question is: what is the most efficient way of going about this?

wallefan
  • 354
  • 1
  • 11

1 Answers1

1

You can merge the values by joining the old and the new temporary table on the file hash and using coalesce.

The coalesce() function returns a copy of its first non-NULL argument, or NULL if all arguments are NULL. Coalesce() must have at least 2 arguments.

Because you'd need old rows that are not in the new rows and vice versa, you'd need a FULL OUTER JOIN which is SQLite doesn't support. But you can emulate it with UNION of LEFT OUTER JOINs (explained in this article).

Then you can create a new table out of the result set drop the old one.

const sql = `
  CREATE TABLE oldinfo(hash UNIQUE, a, b, c);
  INSERT INTO oldinfo VALUES
    ('x', 10, NULL, 20),
    ('z', NULL, NULL, 30);

  CREATE TEMP TABLE newinfo(hash UNIQUE, a, b, c);
  INSERT INTO newinfo VALUES
    ('x', NULL, 1, NULL),
    ('y', 2, NULL, NULL),
    ('z', NULL, NULL, 3);

  SELECT 
    coalesce(n.hash, o.hash) hash,
    coalesce(n.a, o.a) a,
    coalesce(n.b, o.b) b,
    coalesce(n.c, o.c) c
  FROM oldinfo o
  LEFT JOIN newinfo n USING(hash)
  UNION  
  SELECT 
    coalesce(n.hash, o.hash) hash,
    coalesce(n.a, o.a) a,
    coalesce(n.b, o.b) b,
    coalesce(n.c, o.c) c
  FROM newinfo n
  LEFT JOIN oldinfo o USING(hash);
`

async function run() {
  const wasmUrl = 'https://cdnjs.cloudflare.com/ajax/libs/sql.js/1.5.0/sql-wasm.wasm'
  const sqljs = await window.initSqlJs({locateFile: file => wasmUrl})
  const db = new sqljs.Database()
  const results = db.exec(sql) 
  ko.applyBindings(results[0])
}
run()
table {border-collapse: collapse}
th, td {border: 1px solid black}
<script src="https://cdnjs.cloudflare.com/ajax/libs/knockout/3.4.2/knockout-min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/sql.js/1.5.0/sql-wasm.min.js"></script>
<table>
  <thead>
    <tr data-bind="foreach: columns"><th data-bind="text: $data"></th></tr>
  </thead>
  <tbody data-bind="foreach: values">
    <tr data-bind="foreach: $data"><td data-bind="text: $data"></td></tr>
  </tbody>
</table>

Python performance bonus

The performance of that probably isn't great, but if I cared about performance I wouldn't be using Python. Any suggestions to do that faster would be welcome but are not expected.

CPython performance likely isn't to blame here. If you worked with other databases' Python DBAPIs, you probably expect executemany to do some sort of multi-value INSERT optimisation for you but it doesn't (and if you're not in a transaction you're doing a transaction per loop iteration which should be very slow). I explained the alternatives in this answer.

saaj
  • 23,253
  • 3
  • 104
  • 105
  • This does exactly what I want. Thank you. Also thanks for the bit at the end regarding my bonus question about performance. It's people like you that keep me from completely losing my faith in StackOverflow. – wallefan May 27 '21 at 06:58