Profile Millions of Text Files In Parallel Using An Sqlite Counter?

Question

A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.

I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:

In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:

I've got this system down! But it's slow @ 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.

placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)

I learned that's a bad idea because

sqlite string search is much slower than indexed search
several hundreds of properties (some 160 characters long) make for really long sql queries
using %s instead of ? is bad security practice... (not a concern ATM)

A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:

Read file for new_properties
Read tally_board for rowid, property
Generate script side client_hash from 2's read
Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
Lookup rowid for every row in new_properties using the client_hash
Write increment to every rowid (now a proxy for property) to tally_board

Step 6. looks like

sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute

The problem with this is

It's still slow!
It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.

A solution to the race condition is to give steps 2-6 an exclusive lock on the db ~~though it doesn't look like reads can get those~~ Lock A Read.

Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.

There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.

score 1 · Answer 1 · answered Oct 27 '16 at 21:54

1

How about a change of table schema? Instead of a column per type, have a type column. Then you have unique rows identified by property and type, like this:

|rowid|prop    |type |count|
============================
|1    |prop_foo|typeA|215  |
|2    |prop_foo|typeB|456  |

This means you can enter a transaction for each and every property of each and every file separately and let sqlite worry about races. So for each property you encounter, immediately issue a complete transaction that computes the next total and upserts the record identified by the property name and file type.

answered Oct 27 '16 at 21:54

user268396

11,576
2
31
26

Does that have the same effect as keeping my current schema but individually (not batch) updating the table as you suggest? – zelusp Oct 27 '16 at 22:23
Well with your current schema you inherently create race conditions between updates arising from unrelated file types. Since a file has (presumably) only a single file type, this means you avoid a whole set of race conditions by keying your records based on file type. However you could define a view that is recomputed periodically (possibly triggered via an explicit command after a batch import of a fresh fileset) which aggregates the results in your old format for convenience. – user268396 Oct 28 '16 at 22:10
In turn, each race condition that you avoid implies one more case in which transactions execute in the happy/fast path, instead of incurring potentially costly locking/rollbacks. – user268396 Oct 28 '16 at 22:11
Also, having a single `type` column is easier to scale than having N type specific columns. After all, each new file type you didn't foresee is now just another value instead of an expensive schema update. – user268396 Oct 28 '16 at 22:15
It turns out writing to the db _more often_ slows things down even more. Recipe: when using sqlite + multiprocessing, write to sqlite as infrequently as possible. Offloading most of what I'd been trying to do in sqlite onto memory cut time down 60%. – zelusp Nov 01 '16 at 01:58

score 0 · Accepted Answer · answered Nov 08 '16 at 05:34

The following sped things up immensely:

Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).

Profile Millions of Text Files In Parallel Using An Sqlite Counter?

2 Answers2

Linked