0

I am attempting to write a script that joins 2 compressed files based on the first column match. I would like to do this in chunks as the original code i work with is CSV file and when used with these files produces a memory error.

The code I use with a memory error (but works with smaller files):

f1 = open('file1.csv', 'r')
f2 = open('file2.csv', 'r')
f3 = open('output.csv', 'w')

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

file2 = list(c2)

for file1_row in c1:
    row = 1
    found = False
    results_row = file1_row  #Moved out from nested loop
    for file2_row in file2:
        x = file2_row[1:]
        if file1_row[0] == file2_row[0]:
            results_row.append(x)
            found = True
            break
    row += 1
    if not found:
        results_row.append('Not found')
    c3.writerow(results_row)



f1.close()
f2.close()
f3.close()

I have tried to make this work where I work with a chunk but think it is in the wrong format.

f1 = open('final1.gz', 'r')
f2 = open('final2.gz', 'r')
f3 = open('results.gz.DONE', 'w')

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

file2 = list(c2)

fileList = ['final_balance.gz', 'final_service.gz']
for fileName in fileList:
    with open(fileName, 'rb') as sourceFile:
        chunk = True
        while chunk:
            chunk = sourceFile.read(bufferSize)
            #file2 = list(c2)  # MemoryError occurs on this line.
        for file1_row in c1:
            row = 1
            found = False
            results_row = file1_row  #Moved out from nested loop
        for file2_row in file2:
            x = file2_row[1:]
            if file1_row[0] == file2_row[0]:
                results_row.append(x)
                found = True
                break
        row += 1
        if not found:
            results_row.append('Not found')
        c3.writerow(results_row)

At this point I am getting the error at:

File "function.py", line 20, file2 = list(c2) MemoryError.

I can't use panda as I don't have access.

martineau
  • 119,623
  • 25
  • 170
  • 301
Pbree
  • 11
  • 3
  • The code `file2 = list(c2)` doesn't appear anywhere in your second code sample… are you sure you aren't running a different version of your script? –  Feb 15 '18 at 01:19
  • woops. sorry duskwuff- was playing around to try to make it work and had deleted it on my last attempt. Have now edited see above. Thanks – Pbree Feb 15 '18 at 01:25
  • It's unclear what your "chunk" reading code it trying to accomplish. For one thing, you need to uncompress the data before creating a `csv.reader` for it (or pass it something that will incrementally read and return whole _rows_ (which are usually whole lines terminated with a newline). To do that with a `.gz` file, you need to use something like the [`gzip` module](https://docs.python.org/3/library/gzip.html#module-gzip), – martineau Feb 15 '18 at 01:40
  • Yes- I know it is unclear but I am unsure how to do it- say I `gzip.open` the files (or if I am working with very large CSV's, then what- how do I read the data in chunks to avoid the memory error? Thanks @Martineau – Pbree Feb 15 '18 at 01:42
  • CSV files in general do not lend themselves to being read and/or processed in arbitrary "chunks" — you need to process them in terms of rows of data. I believe this mean you'll need to create a wrapper that uncompresses the gzipped data and only yields whole rows from it. – martineau Feb 15 '18 at 01:46
  • okay- i'll look into this. Thanks – Pbree Feb 15 '18 at 01:51
  • My answer to the question [Splitting a CSV file into equal parts?](https://stackoverflow.com/questions/30947682/splitting-a-csv-file-into-equal-parts) contains a useful example of splitting a CSV file into valid "chunks". This general idea would also apply to data from a compressed source (assuming it's first uncompressed with something like `gzip`, of course). – martineau Feb 15 '18 at 02:24
  • Thanks for your assistance @Martineau – Pbree Feb 15 '18 at 02:49

0 Answers0