0

I'm struggling with comparing (diff'ing) 2 large CSV files.

  • The order of rows doesn't matter
  • I don't need to print the differences or anything, just a True or False.

For example:

File 1

a,b,c,d
e,f,g,h
i,j,k,l

File 2

a,b,c,d
i,j,k,l
e,f,g,h

The above should pass comparison, even though the rows are not in the same order, the contents are identical.

Comparison should fail if contents differ, column values don't match, or a line exists in one but not another, etc.

The biggest issue I have is that the files are very large, and there is no key column to sort on. Files have 14 to 30 million rows, about 10 to 15 columns. A raw dump of the data, unsorted, comes to around 1gb csv file.

Right now I'm trying to sort the files and "diff" using the code below. The problem is that the "Sort" does not always work. With smaller files and less rows, the sort & diff works, but it doesn't seem to work with very large files.

Also, sorting adds significant operation time; ideally I would like to avoid sorting and just compare ignoring sort order, but I don't know how to do that.

filecmm, difflib and some other functions I tried all require pre-sorted files.

I'm performing a Python Merge Sort right now, but like I said, the sorting doesn't necessarily work on very large number of rows, and I'm hoping for a better way to compare.

This is the python merge sort function:

def batch_sort(self, input, output, key=None, buffer_size=32000, tempdirs=None):
                if isinstance(tempdirs, str):
                        tempdirs = tempdirs.split(",")

                if tempdirs is None:
                        tempdirs = []
                if not tempdirs:
                        tempdirs.append(gettempdir())

                chunks = []
                try:
                        with open(input,'rb',64*1024) as input_file:
                                input_iterator = iter(input_file)
                                for tempdir in cycle(tempdirs):
                                        current_chunk = list(islice(input_iterator,buffer_size))
                                        if not current_chunk:
                                                break
                                        current_chunk.sort(key=key)
                                        output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
                                        chunks.append(output_chunk)
                                        output_chunk.writelines(current_chunk)
                                        output_chunk.flush()
                                        output_chunk.seek(0)
                        with open(output,'wb',64*1024) as output_file:
                                output_file.writelines(self.merge(key, *chunks))
                finally:
                        for chunk in chunks:
                                try:
                                        chunk.close()
                                        os.remove(chunk.name)
                                except Exception:
                                        pass

I can call batch_sort(), give it an input file and output file, size of chunks, and the temporary directory to use.

Once I perform batch_sort() on both files, I can just "diff file1 file2".

The above works with 25,000 to 75,000 rows, but not when we're talking in excess of 14 million rows.

luci5r
  • 495
  • 1
  • 4
  • 10

3 Answers3

6

Just use a set and add each row. Compare the sets at the end:

def compare(file1, file2):
    with open(file1) as fh1, open(file2) as fh2:
        left = {line for line in fh1}
        right = {line for line in fh2}

    return left == right

If you're really concerned about size, you can consume one file and the second a line isn't found in the second file, you can short-circuit it:

def compare(file1, file2):
    with open(file1) as fh:
        left = {line for line in fh}
    
    right = set()

    with open(file2) as fh:
        for line in fh:
            if line not in left:
                return False
        
            right.add(line)

    return left == right

Edit

Since you don't care to show the differences, just check if they are the same, you can use a numeric hash on each row and compare the sums for each file:

def are_equivalent(a, b):
    with open(a) as fh, open(b) as gh:
        x = sum(hash(line) for line in fh)
        y = sum(hash(line) for line in gh)

    return x == y

This way you aren't paying for the storage of lines in any data structure

C.Nivs
  • 12,353
  • 2
  • 19
  • 44
  • I might be tempted to use `hashlib.sha256()` rather than `hash()` as I believe the latter is *only* 64bit – JonSG Aug 28 '23 at 17:08
  • @JonSG I've modified this to just add the lines themselves to the set. You could use whatever hashing algorithm you want here, but python's internals should be fine – C.Nivs Aug 28 '23 at 17:11
  • 1
    Agreed, I don't mean to suggest this is a bad solution. I think it is exactly right. Only that with 14 million rows the chance of an inadvertent collision might be unacceptably high. Not knowing what the chance might be, I found this [Probability of 64bit Hash Code Collisions](https://stackoverflow.com/a/22029380/218663) suggesting my fears might be misguided. – JonSG Aug 28 '23 at 17:20
  • For files with size 1.6G (one is the `shuf` version of the other) your version takes ~10s (using \time) while `\time diff -q <(sort file1.csv) <(sort file2.csv)` is between 3s and 4s. That was using python3.11.4. – rochard4u Aug 28 '23 at 17:28
  • @C.Nivs - This works really well!! I think this is a much better solution for my use case. I'm going to try a few different volumes of data and see how that goes, Thanks! – luci5r Aug 28 '23 at 18:24
  • @C.Nivs - Quick question, using one of the set methods, is it possible to print the differences? – luci5r Sep 01 '23 at 00:28
  • @luci5r yes, you could use `set.symmetric_difference` – C.Nivs Sep 01 '23 at 15:05
0

It's unlikely to ever work using lowest common denominator algorithms. Big-Oh notation tells you that these operations are at least O(n) to read, sort, and compare where n == # of rows in the file.

there is no key column to sort on - perhaps there's no natural key in the data, but you should be able to generate a hash string for each row and sort on that. You don't need all the data if you can guarantee that the hash is unique for each row. You can operate on the list of hashes then.

The biggest problem that I see is the requirement to have both in memory at the same time. Physics says you can't fit X into a Y capacity container when X > Y.

duffymo
  • 305,152
  • 44
  • 369
  • 561
0

If you're open to using something like numpy:

import numpy as np

a1 = np.genfromtxt(file1, delimiter=',', dtype=str)
a2 = np.genfromtxt(file2, delimiter=',', dtype=str)

isequal = np.array_equal(np.sort(a1, axis=0), np.sort(a2, axis=0))
print(isequal)
True
BeRT2me
  • 12,699
  • 2
  • 13
  • 31