Matching a string in two csv files but second file is too large to read in to a list

Question

The following code works for file sizes upto 3 million records but greater than this I run out of memory because I am reading the data into lists and then using the lists to loop and find matches.

From previous postings, I have gathered that I should process each line at a time through the loop but can not find any postings on how to take a line at a time from a CSV file and process it through two iteration loops as in my code below.

Any help would be greatly appreciated. Thanking you in advance.

import csv

# open two csv files and read into lists lsts and lstl
with open('small.csv') as s:
    sml = csv.reader(s)
    lsts = [tuple(row) for row in sml]

with open('large.csv') as l:
    lrg = csv.reader(l)
    lstl = [tuple(row) for row in lrg] # can be two large for memory

# find a match and print 
for rows in lsts:
    for rowl in lstl:

        if rowl[7] != rows[0]: # if no match continue
            continue
        else: 
            print(rowl[7], rowl[2]) # when matched print data required from large file

tdelaney · Accepted Answer · 2020-05-09T15:55:08.350

1

Assuming that you are only interested in one column of the small csv, you can turn it into a set and compare row by row with the large csv. The set comparison replaces the outer loop completely

import csv

with open('small.csv') as s:
    sml = csv.reader(s)
    sml_set = set(row[0] for row in sml)

with open('large.csv') as l:
    lrg = csv.reader(l)
    for row in lrg:
        if row[7] in sml_set:
            print(rowl[7], rowl[2])

You could turn this into a generator like

def row_matches():
    with open('small.csv') as s:
        sml = csv.reader(s)
        sml_set = set(row[0] for row in sml)

    with open('large.csv') as l:
        lrg = csv.reader(l)
        for row in lrg:
            if row[7] in sml_set:
                yield rowl[7], rowl[2]

edited May 09 '20 at 15:55

answered May 09 '20 at 15:48

tdelaney

73,364
6
83
116

Can using [yield](https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python) also work for reading a large file? – May 09 '20 at 15:52
Yes, but it needs to be in a function. I'll post an example. – tdelaney May 09 '20 at 15:53
Thanks for this solution. It is a good work around. I guess I could do this with a single column list or tuple either (I'll try that). I was rather hoping for a double nested loop answer since later I may have to deal with two large files. I realise this is my fault for stating my question as one small and one large file, so I got what I asked for, but if there is anyone who knows how to do this for two large files that don't need to clog up memory then please share. – David Cowburn May 10 '20 at 16:23
The larger the files the worse double for loops scale. Sets are lookup efficient and can be space efficient if you only want a small part of a larger row or if there are duplicates in the data. If the small file grows large, you could do chuncks of the file at a time. From your example, all I know is that you want `rows[0]` so there is no reason to keep the rest of the row. After that, a set instead of a list is really about lookup speed. And that lookup replaces one of the for loops because it is a for loop implemented with an intermediate hash in the C code that implements sets. – tdelaney May 10 '20 at 16:43

Matching a string in two csv files but second file is too large to read in to a list

1 Answers1