Fastest Way to search in Gigabytes of data?

Question

I have a cvs file of size 8-12 GB, and I would like to be able to search the first column of the file and retrieve the whole rows if there is a match. I would like to do the search for a set of more than 100K keys every time and retrieve the corresponding record for them.

There are a couple of approaches I can choose:

1) use a simple grep for each key in the file ==> 100K grep commands

2) make a SQL based database and index the first column then: a) search for each key by one select query. b) make a temporary table and insert all the keys to it and then do a set membership

3) make a hash function, such as Python dictionary and then search it by key. But I need to load it into the memory every time I need to do a bulk of queries (I don't want it to always occupy the memory)

I'm not sure which method is more efficient? Or any better options which I'm not aware of.

almost certainly a database or nosql keystore is the way to go... that said this is also almost certainly OT :/ — Joran Beasley, Apr 06 '17 at 03:45
you should consider pyspark,it is optimze for such operations — Eliethesaiyan, Apr 06 '17 at 04:05

score 0 · Answer 1 · edited May 23 '17 at 12:25

0

You can read chunks of the csv iterated using pandas. Perhaps this solution can work for you: How to read a 6 GB csv file with pandas

edited May 23 '17 at 12:25

Community

1
1

answered Apr 06 '17 at 04:22

mwkrimson

115
1
9

score 0 · Answer 2 · answered Apr 06 '17 at 04:37

The fastest solution (if you have plenty of RAM) would be to just mmap the whole file.

What would certainly work is to read the file one line at a time:

# keys is an iterable of keys.
sep = ';' # The separator used in the CSV.
with open('foo.csv') as f:
    for line in f:
        to = line.find(sep)
        if line[:to] in keys:
            # do something

Fastest Way to search in Gigabytes of data?

2 Answers2