1

I have two csv files with the size 3 GB each to compare and store the difference in the third file.

Python Code:

with open('JUN-01.csv', 'r') as f1:
    file1 = f1.readlines()

with open('JUN-02.csv', 'r') as f2:
    file2 = f2.readlines()

with open('JUN_Updates.csv', 'w') as outFile:
    outFile.write(file1[0])
    for line in file2:
        if line not in file1:
            outFile.write(line)

Time taken to execute: 45 min and still running...

MAK
  • 6,824
  • 25
  • 74
  • 131
  • What are you doing next with these files? Wrap the reading as iterator or work with file chuncks, there are plenty of answers on this at SO. – Evgeny Jun 04 '18 at 10:51
  • Does the order matter? Are the files sorted to begin with? Have you looked at for example https://stackoverflow.com/questions/4717250/extracting-unique-values-between-2-sets-files? – Ilja Everilä Jun 04 '18 at 10:53
  • @IljaEverilä, No! Order does not matter. – MAK Jun 04 '18 at 11:01
  • I know you've tried using Python and tagged as Python, but this might be a job better performed using other tools, such as `tail`, `sort`, and `comm`, if using some posixy OS. – Ilja Everilä Jun 04 '18 at 11:08

2 Answers2

5

Not sure if it's too late already, but here it comes.

I see you are loading 2 arrays in memory, with your full files. If you say they are about 3 GB each, that is trying to populate 6 GB in RAM and probably entering swap.

Furthermore, even if you succeed in loading the files, then you are attempting ~ L1xL2 string comparisons (L1 and L2 being the line counts).

I have run the following code in a 1.2 GB (3.3 million lines) and completes in seconds. It uses string hashes, and only loads in RAM a set of L1 integer32.

The trick is done here, creating a set() after applying the hashstring function to every line in the file (except the header, which you seem to be adding to the output).

file1 = set(map(hashstring, f1))

Please note I am comparing the file against itself (f2 loads the same file as f1). Let me know if it helps.

from zlib import adler32

def hashstring(s):
    return adler32(s.encode('utf-8'))

with open('haproxy.log.1', 'r') as f1:
    heading = f1.readline()
    print(f'Heading: {heading}')
    print('Hashing')
    file1 = set(map(hashstring, f1))
    print(f'Hashed: {len(file1)}')

with open('updates.log', 'w') as outFile:
    count = 0
    outFile.write(heading)
    with open ('haproxy.log.1', 'r') as f2:
        for line in f2:
            if hashstring(line) not in file1:
                outFile.write(line)
            count += 1
            if 0 == count % 10000:
                print(f'Checked: {count}')
Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Lufo0
  • 66
  • 2
  • Getting an error: `UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 234: ordinal not in range(128)` – MAK Jul 26 '18 at 04:16
  • You can try treating the files as binary instead of text, with open(..., 'rb') and doing the hashing on the raw bytes: `with open('haproxy.log.1', 'rb') as f1: file1 = set(map(adler32, f1)) with open('updates.log', 'wb') as outFile: outFile.write(heading) with open ('haproxy.log.1', 'rb') as f2: for line in f2: if adler32(line) not in file1: outFile.write(line) ` – Lufo0 Jul 26 '18 at 07:24
  • Can we do the same for specified 2 or 3 columns? – MAK Jul 27 '18 at 06:21
  • Sure, that's a great use case for the CSV module included in the [Python Standard Library](https://docs.python.org/3.7/library/csv.html). You can easily modify the script to get just a few columns. As the CSV reader keeps them as strings, you can just concatenate them and keep hashing with Adler32. – Lufo0 Jul 27 '18 at 16:52
0

Try the following, if difflib can help in efficiency:-

import difflib
import sys

with open('JUN_Updates.csv', 'w') as differenceFile:
    with open('JUN-01.csv', 'r') as june1File:
        with open('JUN-02.csv', 'r') as june2File:
            diff = difflib.unified_diff(
                june1File.readlines(),
                june2File.readlines(),
                fromfile='june1File',
                tofile='june2File',
            )

            lines = list(diff)[2:]
            added = [line[1:] for line in lines if line[0] == '+']
            removed = [line[1:] for line in lines if line[0] == '-']

            for line in added:
                differenceFile.write(line)
nandal
  • 2,544
  • 1
  • 18
  • 23