Python 3.7: Performance tuning on huge data files comparison

Question

I have two csv files with the size 3 GB each to compare and store the difference in the third file.

Python Code:

with open('JUN-01.csv', 'r') as f1:
    file1 = f1.readlines()

with open('JUN-02.csv', 'r') as f2:
    file2 = f2.readlines()

with open('JUN_Updates.csv', 'w') as outFile:
    outFile.write(file1[0])
    for line in file2:
        if line not in file1:
            outFile.write(line)

Time taken to execute: 45 min and still running...

What are you doing next with these files? Wrap the reading as iterator or work with file chuncks, there are plenty of answers on this at SO. — Evgeny, Jun 04 '18 at 10:51
Does the order matter? Are the files sorted to begin with? Have you looked at for example https://stackoverflow.com/questions/4717250/extracting-unique-values-between-2-sets-files? — Ilja Everilä, Jun 04 '18 at 10:53
I know you've tried using Python and tagged as Python, but this might be a job better performed using other tools, such as `tail`, `sort`, and `comm`, if using some posixy OS. — Ilja Everilä, Jun 04 '18 at 11:08

score 5 · Accepted Answer · edited Aug 02 '18 at 20:54

Not sure if it's too late already, but here it comes.

I see you are loading 2 arrays in memory, with your full files. If you say they are about 3 GB each, that is trying to populate 6 GB in RAM and probably entering swap.

Furthermore, even if you succeed in loading the files, then you are attempting ~ L1xL2 string comparisons (L1 and L2 being the line counts).

I have run the following code in a 1.2 GB (3.3 million lines) and completes in seconds. It uses string hashes, and only loads in RAM a set of L1 integer32.

The trick is done here, creating a set() after applying the hashstring function to every line in the file (except the header, which you seem to be adding to the output).

file1 = set(map(hashstring, f1))

Please note I am comparing the file against itself (f2 loads the same file as f1). Let me know if it helps.

from zlib import adler32

def hashstring(s):
    return adler32(s.encode('utf-8'))

with open('haproxy.log.1', 'r') as f1:
    heading = f1.readline()
    print(f'Heading: {heading}')
    print('Hashing')
    file1 = set(map(hashstring, f1))
    print(f'Hashed: {len(file1)}')

with open('updates.log', 'w') as outFile:
    count = 0
    outFile.write(heading)
    with open ('haproxy.log.1', 'r') as f2:
        for line in f2:
            if hashstring(line) not in file1:
                outFile.write(line)
            count += 1
            if 0 == count % 10000:
                print(f'Checked: {count}')

Getting an error: `UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 234: ordinal not in range(128)` — MAK, Jul 26 '18 at 04:16
You can try treating the files as binary instead of text, with open(..., 'rb') and doing the hashing on the raw bytes: `with open('haproxy.log.1', 'rb') as f1: file1 = set(map(adler32, f1)) with open('updates.log', 'wb') as outFile: outFile.write(heading) with open ('haproxy.log.1', 'rb') as f2: for line in f2: if adler32(line) not in file1: outFile.write(line) ` — Lufo0, Jul 26 '18 at 07:24
Sure, that's a great use case for the CSV module included in the [Python Standard Library](https://docs.python.org/3.7/library/csv.html). You can easily modify the script to get just a few columns. As the CSV reader keeps them as strings, you can just concatenate them and keep hashing with Adler32. — Lufo0, Jul 27 '18 at 16:52

score 0 · Answer 2 · answered Jun 04 '18 at 10:55

Try the following, if difflib can help in efficiency:-

import difflib
import sys

with open('JUN_Updates.csv', 'w') as differenceFile:
    with open('JUN-01.csv', 'r') as june1File:
        with open('JUN-02.csv', 'r') as june2File:
            diff = difflib.unified_diff(
                june1File.readlines(),
                june2File.readlines(),
                fromfile='june1File',
                tofile='june2File',
            )

            lines = list(diff)[2:]
            added = [line[1:] for line in lines if line[0] == '+']
            removed = [line[1:] for line in lines if line[0] == '-']

            for line in added:
                differenceFile.write(line)

Python 3.7: Performance tuning on huge data files comparison

2 Answers2