What is the most efficient way to match list items to lines in a large file in Python?

Question

I have a large file (5Gb) called my_file. I have a list called my_list. What is the most efficient way to read each line in the file and, if an item from my_list matches an item from a line in my_file, create a new list called matches that contains items from the lines in my_file AND items from my_list where a match occurred. Here is what I am trying to do:

def calc(my_file, my_list)
    matches = []
    my_file.seek(0,0)
    for i in my_file:
        i = list(i.rstrip('\n').split('\t'))
        for v in my_list:
            if v[1] == i[2]:
                item = v[0], i[1], i[3]
                matches.append(item)
    return matches

here are some lines in my_file:

lion    4    blue    ch3
sheep   1    red     pq2
frog    9    green   xd7
donkey  2    aqua    zr8

here are some items in my_list

intel    yellow
amd      green
msi      aqua

The desired output, a list of lists, in the above example would be:

[['amd', 9, 'xd7'], ['msi', 2, 'zr8']]

My code is currently work, albeit really slow. Would using a generator or serialization help? Thanks.

"really slow"? Please post 2 things. The actual time it takes to run and the time required to do `open("my_file","r").read()`. — S.Lott, Sep 08 '11 at 10:44
@S.Lott: You're right I/O can dominate in this case; though the file is 5G so `for _ in open('my_file'): pass` might be more appropriate here. — jfs, Sep 08 '11 at 12:13
@J.F. Sebastian: Good point. Without numbers, however, this could simply be a standard case of premature optimization. — S.Lott, Sep 08 '11 at 13:23
With my current code the program takes about 2 days to complete. I ran the numbers on a version of `my_file` which is about 1/100 of the size and the times were about 1h to run the program and less than a minute to open the file. — drbunsen, Sep 08 '11 at 13:57

rocksportrocker · Accepted Answer · 2011-09-08T11:48:41.263

3

You could build a dictonary for looking up v. I added further little optimizations:

def calc(my_file, my_list)

    vd = dict( (v[1],v[0]) for v in my_list)

    my_file.seek(0,0)
    for line in my_file:
        f0, f1, f2, f3 = line[:-1].split('\t')
        v0 = vd.get(f2)
        if v0 is not None:
           yield (v0, f1, f3)

This should be much faster for a large my_list.

Using get is faster than checking if i[2] is in vd + accessing vd[i[2]]

For getting more speedup beyond these optimizations I recommend http://www.cython.org

edited Sep 08 '11 at 11:48

answered Sep 08 '11 at 10:51

rocksportrocker

7,251
2
31
48

1

Good point about using `.get()`. I noticed that and updated my answer accordingly, but +1 to you for explicitly mentioning it. – Shawn Chin Sep 08 '11 at 11:07
2

btw, `.split()` returns a list, so no need to call `list()` on it. – Shawn Chin Sep 08 '11 at 11:09
1

`f1,f2,f3 = line[:-1].split()` will raise a `ValueError` exceptions if the number of items to unpack does not match. – Shawn Chin Sep 08 '11 at 11:13
@Shawn: ok, I pasted it from the original and fix it now – rocksportrocker Sep 08 '11 at 11:26
@Shawn: you are right, but the authors example has three columns, and tuple unpacking should be a bit faster than access via indices. – rocksportrocker Sep 08 '11 at 11:27
The OP's example has 4. I'm not doubting your approach. I'm a fan of defensive programming and I try not to make too many assumptions. As an aside, when using tuple unpacking, specifying `maxsplit` with `split()` will help insure against extra columns. – Shawn Chin Sep 08 '11 at 11:35
ouch. counting up to 4 is difficult ;) I fix it. – rocksportrocker Sep 08 '11 at 11:47
Thanks rocksportrocker. My list is not super large, but I will try your suggestions and see if this improves my speed. – drbunsen Sep 08 '11 at 13:59
This helped a bit. Not a huge performance boost, but I will take what I can get. Since there doesn't appear to be an obvious way to get a huge performance boost I will probably look into multi-threading. Thanks for the help. – drbunsen Sep 08 '11 at 19:53
As others mentioned you should measure how fast the reading of the file is. just insert a "continue" below the for statement, and you will see how much time is spent in reading the 5gb large file. – rocksportrocker Sep 09 '11 at 08:05

score 0 · Answer 2 · answered Sep 08 '11 at 10:50

Keep the items in a dictional rather than a list (let's call it items). Now iterate through your file as you're doing and pick out the key to look for (i[2]) and then check if it's there in the in items.

items would be.

dict (yellow = "intel", green = "amd", aqua = "msi")

So the checking part would be.

if i[2] in items:
  yield [[items[i[2]], i[1], i[3]]

Since you're just creating the list and returning it, using a generator might help memory characteristics of the program rather than putting the whole thing into a list and returning it.

Shawn Chin · Answer 3 · 2011-09-08T11:23:31.143

There isn't much you can do with the overheads of reading the file in, but based on your example code, you can speed up the matching by storing your list as a dict (with the target field as the key).

Here's an example, with a few extra optimisation tweaks:

mylist = {
    "yellow" : "intel",
    "green" : "amd",
    # ....
}

matches = []
for line in my_file:
    i = line[:-1].split("\t")
    try:  # faster to ask for forgiveness than permission
        matches.append([mylist[i[2]], i[1], i[3]])
    except NameError:
        pass

But again, do note that most of your performance bottleneck will be in the reading of the file and optimisation at this level may not have a big enough impact on the runtime.

score 0 · Answer 4 · edited May 23 '17 at 12:04

Here's a variation on @rocksportrocker's answer using csv module:

import csv

def calc_csv(lines, lst):
    d = dict((v[1], v[0]) for v in lst) # use dict to speed up membership test
    return ((d[f2], f1, f3)
            for _, f1, f2, f3 in csv.reader(lines, dialect='excel-tab')
            if f2 in d) # assume that intersection is much less than the file

Example:

def test():
    my_file = """\
lion    4   blue    ch3
sheep   1   red pq2
frog    9   green   xd7
donkey  2   aqua    zr8
""".splitlines()

    my_list = [
    ("intel",    "yellow"),
    ("amd",      "green"),
    ("msi",      "aqua"),
    ]    

    res = list(calc_csv(my_file, my_list))
    assert [('amd', '9', 'xd7'), ('msi', '2', 'zr8')] == res


if __name__=="__main__":
   test()

What is the most efficient way to match list items to lines in a large file in Python?

4 Answers4

Linked