I am looking to maximize performance in the following situation:
- I have about 1M ids stored in my database.
- I have a daily export of ~1.05M ids that I do not control.
- I need to see which items I have -- skip those -- see which items I don't have -- add those -- and see which items have been removed -- prune those.
When should I use a database to use this and when should I use a set? For example, the difference to 'pass':
# using a set
all_ids = select id from mytable
all_ids = set(all_ids)
for item in report:
id = item['id']
if id in all_ids:
pass
# using a db lookup for each item
for item in report:
id = item['id']
if (select 1 from mytable where id=id):
pass
When should I use which? Obviously, if there were only ten items, an in-memory list or set would work fine; on the other hand, if there were 1 trillion ids, I would need to use a db lookup, because I wouldn't have enough memory locally. How would I make this determination?