0

I have a cvs file of size 8-12 GB, and I would like to be able to search the first column of the file and retrieve the whole rows if there is a match. I would like to do the search for a set of more than 100K keys every time and retrieve the corresponding record for them.

There are a couple of approaches I can choose:

1) use a simple grep for each key in the file ==> 100K grep commands

2) make a SQL based database and index the first column then: a) search for each key by one select query. b) make a temporary table and insert all the keys to it and then do a set membership

3) make a hash function, such as Python dictionary and then search it by key. But I need to load it into the memory every time I need to do a bulk of queries (I don't want it to always occupy the memory)

I'm not sure which method is more efficient? Or any better options which I'm not aware of.

Alex
  • 1,914
  • 6
  • 26
  • 47

2 Answers2

0

You can read chunks of the csv iterated using pandas. Perhaps this solution can work for you: How to read a 6 GB csv file with pandas

Community
  • 1
  • 1
mwkrimson
  • 115
  • 1
  • 9
0

The fastest solution (if you have plenty of RAM) would be to just mmap the whole file.

What would certainly work is to read the file one line at a time:

# keys is an iterable of keys.
sep = ';' # The separator used in the CSV.
with open('foo.csv') as f:
    for line in f:
        to = line.find(sep)
        if line[:to] in keys:
            # do something
Roland Smith
  • 42,427
  • 3
  • 64
  • 94