When to iterate over a set, or when to use a database

Question

I am looking to maximize performance in the following situation:

I have about 1M ids stored in my database.
I have a daily export of ~1.05M ids that I do not control.
I need to see which items I have -- skip those -- see which items I don't have -- add those -- and see which items have been removed -- prune those.

When should I use a database to use this and when should I use a set? For example, the difference to 'pass':

# using a set
all_ids = select id from mytable
all_ids = set(all_ids)
for item in report:
    id = item['id']
    if id in all_ids:
        pass

# using a db lookup for each item
for item in report:
    id = item['id']
    if (select 1 from mytable where id=id):
        pass

When should I use which? Obviously, if there were only ten items, an in-memory list or set would work fine; on the other hand, if there were 1 trillion ids, I would need to use a db lookup, because I wouldn't have enough memory locally. How would I make this determination?

I have a feeling there is no definitive answer to this, as it will depend primarily on the size of your database entries, the specific database you're considering, the capabilities of the machines you're targetting, etc. — jedwards, Mar 09 '15 at 19:15
@jedwards How would I go about profiling the above then to find out? — David542, Mar 09 '15 at 19:17
You can [index](http://dev.mysql.com/doc/refman/5.5/en/optimization-indexes.html) the `id` column, that will improve the query performance significantly but will result in extra memory consumption and slower insert, update queries. — Ashwini Chaudhary, Mar 09 '15 at 19:20
When you say "I need to see which items I have -- skip those -- see which items I don't have -- add those -- and see which items have been removed -- prune those" you are just saying, ids will now be determined by the report. — Alan, Mar 09 '15 at 19:21

score 1 · Accepted Answer · edited May 23 '17 at 10:24

To avoid running out of memory in the application, let the database do the searching, which it's good at. I'm assuming you have id indexed.

Use a prepared (parameterized) query for the SELECT. A prepared query only needs to be parsed one time, then from then on, the parameter is sent via an efficient binary protocol. This saves a lot of overhead (you're not sending the entire query to the database server for each row).

Standard Python may not support prepared queries, so you may have to use a third-party library such as oursql, mentioned in Does the MySQLdb module support prepared statements?.

MySQL has optimizations built in for reading ahead on disk when reading many consecutive pages, which allows it to take advantage of modern disk caches that also read ahead. In fact, MySQL might not hit disk at all if your server has enough memory for the entire index to stay in memory. Also, one query removes all of the network overhead and latency you'll get from sending a million commands and waiting for the response between the application and the database server. So, while it would be more efficient for MySQL to give you all of the data at once, you'd still be stuck with allocating memory for all of the set entries, searching through the set in the application, and possibly running out of memory.

Even if you read all of the data in at once, you're likely using buffered queries by default, so the result is sitting in memory until you read it all anyway. An unbuffered query would complicate matters even more.

Thanks for this explanation, that helps a lot. How would I make a `prepared query` with mysql? Could you please give an example of that? — David542, Mar 09 '15 at 19:57
Doh! Python doesn't support prepared (parameterized) queries by default. But the [oursql](https://launchpad.net/oursql) library does. — Marcus Adams, Mar 09 '15 at 20:04

When to iterate over a set, or when to use a database

1 Answers1