8

I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.

However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.

Requirements:

  • Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
  • Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
  • Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
  • Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
  • Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.

Preferences:

  • I'd like to avoid any sort long running daemon process at all, if possible.
  • While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.

In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.

Candidates:

  • SQLite in memory

    SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.

  • Flat file database loaded into memory with mmap

    On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.

    I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.

  • Share Python objects between processes using Processing

    The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?

  • Store data on a RAM disk

    My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.

I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?

I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.

I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.

Thank you so much for your help!

Community
  • 1
  • 1
dimo414
  • 47,227
  • 18
  • 148
  • 244
  • 4
    I wouldn't jump to conclusions like this too quickly: "SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution." You only know if it will be fast enough if you try it. Moreover, the OS will do caching anyway. – Sven Marnach Jun 09 '12 at 17:42
  • Redis or Memcached may also be possible solutions. – Sven Marnach Jun 09 '12 at 17:44
  • Why *not* use a long-running process? This usecase is exactly what Redis and Memcached solve, with the added advantage that you can keep your cache across restarts of your own program. – Martijn Pieters Jun 09 '12 at 17:49
  • If you require querying beyond key/val, then how would pickle be a candidate unless you are pickling your own custom query enabled class? Also pickle isnt a good solution for shared memory, unless you add locking around the file access. Do you need shared memory between multiple processes or is this only for a single process? Pickle works if they are all just reading tho – jdi Jun 09 '12 at 18:03
  • @SvenMarnach, that's a good point about the OS caching, I hadn't taken that into consideration. But even if the file's cached, I'm concerned (probably unnecessarily) about the cost of SQLite parsing/loading the file every time. – dimo414 Jun 09 '12 at 19:01
  • @MartijnPieters, I agree, but I'm trying to minimize dependencies and barriers to setup. If I can have a tool who's only basic dependency is Python, that's much more desirable than one which also requires the user to set up Memcached or something similar. – dimo414 Jun 09 '12 at 19:03
  • @jdi Like I said, quick reads are more important than quick writes, and data doesn't have to stay in-sync. I'm not positive exactly how I'd do it, but I imagine I'd have a mmap/other memory object storing the pickled data, which is read in by as many processes as need it, then if one is making an update, it writes to a temp file (and therefore holds a lock) which it then replaces with the file being read atomically. – dimo414 Jun 09 '12 at 19:19

1 Answers1

7

I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!

Gerrat
  • 28,863
  • 9
  • 73
  • 101