Quickest and most efficient way to search large sorted text file

Question

I have a large static text/csv file, which contains approx 100k rows (2MB). It's essentially a dictionary, and I need to perform regular lookups on this data in Python.

The format of the file is:

    key         value1       value2     
    alpha       x1           x2
    alpha beta  y1           y2
    gamma       z1           z2  
    ...

The keys can be multi-word strings.
The list is sorted in alphabetical order by the key
The values are strings

This is part of a web application where every user will be looking up 100-300 keys at a time, and will expect to get both value 1 and value 2 for each of those keys. There will be up to 100 users on the application each looking up those 100-300 keys over the same data.

I just need to return the first exact match. For example, if the user searched for the keys [alpha, gamma], I just need to return [('x1','x2'), ('z1','z2')], which represents the first exact match of 'alpha' and 'gamma'.

I've been reading about the options I have, and I'd really love your input on which of the following approaches is best for my use case.

Read the file once into an ordered set, and perform the 200 or so lookups. However, for every user using the application (~100), the file will be loaded into memory.
Read the file once into a list, and use binary search (e.g. bisect). Similar problem as 1.) the file will be loaded into memory for every user who needs to do a search.
Don't read the entire file into memory, and just read the file one line at a time. I can split the .csv into 26 files by each letter (a.csv, b.csv, ...) to speed this up a bit.
Whoosh is a search library that caught my eye since it created an index once. However, I'm not sure if it's applicable for my use case at all as it looks like a full text search and I can't limit to just looking up the first column. If this specific library is not an option, is there any other way I can create a reusable index in Python to support these kinds of lookups?

I'm really open to ideas and I'm in no way restricted to the four options above!

Thank you :)

This looks like a job for a database like SQLite or maybe something bigger like MySQL or PostgreSQL. — Michael Butscher, Apr 03 '19 at 00:12
Loading the file into a dictionary and searching by `key` is one way of handling the task. I'd create a service to parse the file into memory and respond to requests with `key` arguments, this way, you don't need to load the static file for each user as mentioned. — Pedro Lobito, Apr 03 '19 at 00:39
Why not read the file into a single process (or a small pool if necessary for throughput) that answers single (multi-key) queries from the web workers over a local socket? — Davis Herring, Apr 03 '19 at 00:42
Thanks so much everyone for your input! @PedroLobito & Davis: A service to respond to requests seems like a great idea; I'm not too sure how to go about it - would this be something like a Flask service? Or is there another option with a lower footprint? Thank you! — Rohan, Apr 03 '19 at 09:30
@MichaelButsche, thanks! You're right, I might end up doing what you said and going for a local SQLite database with an index. — Rohan, Apr 03 '19 at 09:31

score 1 · Answer 1 · answered Apr 03 '19 at 00:47

How about something similar to approach #2. You could still read the file into memory but instead of storing it into a list and using binary search for searching up keys, you could store the file into a hash map.

The benefit of doing this is to take advantage of a hash map's average lookup time of O(1) with a worst case of O(n). The time complexity benefit and justification can be found here and here. Since you're only looking up keys, having constant lookup time would be a great way to search through the file. This method would also be faster than binary search's average O(log n) search time.

You could store your file as

table = {
    key1: (value1, value2),
    key2: (value1, value2),
    key2: (value1, value2)
}

Note this method is only viable if your keys are all distinct with no duplicate keys.

Yes, #1 should have said `dict`, but this doesn’t address the memory usage. — Davis Herring, Apr 03 '19 at 01:01
Thanks so much for the suggestion, @nathancy! Yes, all keys are distinct and a hashmap would be just perfect if it weren't for multiple users needing to access the application at the same time (thus resulting in the file being loaded into memory for every user). However, I'm going to attempt using one of the suggestions in the comments above (creating a local service) and use your approach of a hashmap in that service. — Rohan, Apr 03 '19 at 09:41

Quickest and most efficient way to search large sorted text file

1 Answers1