I've been working on and off for a few months on a python script that connects to the twitter streaming API and hunts for anagrams.
Source is on github. It's simple; when I receive a new tweet I strip it to alpha characters, and sort that string in alphabetical order. This serves as a hash.
Currently hashes are stored in a python set, because checking the (on disk) database was taking too long. But: I also wasn't using UNIQUE on the hashes key.
How much of a performance improvement would I get with UNIQUE? Is there a way for checking inclusion without using a SELECT statement? Ideally I guess the hash should be the PRIMARY KEY. Inclusion checking is currently separated from fetching; fetching is performed periodically in batches, to increase performance.
Basically I need a solution that lets me do tons of inclusion checks (maybe up to 50/s, on a database that might have 25m rows) and do regular batch fetches, but not much else. I do not, for instance, need to delete very often.
Does this seem feasible for an on-disk sqlite store? A :memory: sqlite store? Another DB solution? Am I just not going to be able to get this kind of performance without using native python data structures? If so, I would just sort of stick with my current general strategy, and spend my energy coming up with a more efficient hashing system.