performance improvements for python + very large sqlite database

Question

I've been working on and off for a few months on a python script that connects to the twitter streaming API and hunts for anagrams.

Source is on github. It's simple; when I receive a new tweet I strip it to alpha characters, and sort that string in alphabetical order. This serves as a hash.

Currently hashes are stored in a python set, because checking the (on disk) database was taking too long. But: I also wasn't using UNIQUE on the hashes key.

How much of a performance improvement would I get with UNIQUE? Is there a way for checking inclusion without using a SELECT statement? Ideally I guess the hash should be the PRIMARY KEY. Inclusion checking is currently separated from fetching; fetching is performed periodically in batches, to increase performance.

Basically I need a solution that lets me do tons of inclusion checks (maybe up to 50/s, on a database that might have 25m rows) and do regular batch fetches, but not much else. I do not, for instance, need to delete very often.

Does this seem feasible for an on-disk sqlite store? A :memory: sqlite store? Another DB solution? Am I just not going to be able to get this kind of performance without using native python data structures? If so, I would just sort of stick with my current general strategy, and spend my energy coming up with a more efficient hashing system.

SQLite uses BTree indexes, which are `O(log(n))` for searches. Whether that's faster or slower than your (presumably) `O(1)` hash is an open question. — Robert Harvey, Jul 08 '13 at 21:13
do you have an index on the "hash" column? also, why are you messing around stripping spaces and sorting the string? why not just use the tweet text? and what version of sqlite are you using? and what does sqlite `explain` show when you run a query by hand? some kind of index should make things faster. whether you are currently using one or not is unclear. — andrew cooke, Jul 08 '13 at 21:36
@andrewcooke By the definition of what an anagram is (http://en.wikipedia.org/wiki/Anagram), sorting the two words, and then comparing them for equality, is actually a clever way of determining if one word/phrase is an anagram of the other. — Robert Kajic, Jul 08 '13 at 21:47
@kajic oh sorry - i completely misse dthe anagram part; i thought this was looking for duplicates. makes much more sense now. — andrew cooke, Jul 08 '13 at 21:51

score 0 · Accepted Answer · edited May 23 '17 at 12:29

0

What is wrong with using a set? Is you application consuming too much memory?

You will never get as good performance using a database, as you get using in-memory python data structures, but an database index will definitely give you 50 lookups per second. You can expect thousands of selects per second, at the very least.

Read more about SQLite performance here:

Improve INSERT-per-second performance of SQLite?

If you decide to use an database you can do lookups using something like:

SELECT count(*) as exists FROM anagrams WHERE letters='abc' LIMIT 1;

You don't need an unique index. Just create a regular index (http://www.sqlite.org/lang_createindex.html):

CREATE INDEX letters_anagrams ON anagrams (letters);

edited May 23 '17 at 12:29

Community

1
1

answered Jul 08 '13 at 21:37

Robert Kajic

8,689
4
44
43

Yea, the problem with using a set is that it was really starting to consume a lot of memory that could be better used for things like caching. – cmyr Jul 08 '13 at 22:38
Then use a database with an index. – Robert Kajic Jul 08 '13 at 23:05
alright, thanks I'll do some more reading and try and develop some better formed questions. – cmyr Jul 09 '13 at 14:37

performance improvements for python + very large sqlite database

1 Answers1