I have a large static text/csv file, which contains approx 100k rows (2MB). It's essentially a dictionary, and I need to perform regular lookups on this data in Python.
The format of the file is:
key value1 value2
alpha x1 x2
alpha beta y1 y2
gamma z1 z2
...
- The keys can be multi-word strings.
- The list is sorted in alphabetical order by the key
- The values are strings
This is part of a web application where every user will be looking up 100-300 keys at a time, and will expect to get both value 1 and value 2 for each of those keys. There will be up to 100 users on the application each looking up those 100-300 keys over the same data.
I just need to return the first exact match. For example, if the user searched for the keys [alpha, gamma]
, I just need to return [('x1','x2'), ('z1','z2')]
, which represents the first exact match of 'alpha' and 'gamma'.
I've been reading about the options I have, and I'd really love your input on which of the following approaches is best for my use case.
Read the file once into an ordered set, and perform the 200 or so lookups. However, for every user using the application (~100), the file will be loaded into memory.
Read the file once into a list, and use binary search (e.g. bisect). Similar problem as 1.) the file will be loaded into memory for every user who needs to do a search.
Don't read the entire file into memory, and just read the file one line at a time. I can split the .csv into 26 files by each letter (a.csv, b.csv, ...) to speed this up a bit.
Whoosh is a search library that caught my eye since it created an index once. However, I'm not sure if it's applicable for my use case at all as it looks like a full text search and I can't limit to just looking up the first column. If this specific library is not an option, is there any other way I can create a reusable index in Python to support these kinds of lookups?
I'm really open to ideas and I'm in no way restricted to the four options above!
Thank you :)