I'm struggling with comparing (diff'ing) 2 large CSV files.
- The order of rows doesn't matter
- I don't need to print the differences or anything, just a True or False.
For example:
File 1
a,b,c,d
e,f,g,h
i,j,k,l
File 2
a,b,c,d
i,j,k,l
e,f,g,h
The above should pass comparison, even though the rows are not in the same order, the contents are identical.
Comparison should fail if contents differ, column values don't match, or a line exists in one but not another, etc.
The biggest issue I have is that the files are very large, and there is no key column to sort on. Files have 14 to 30 million rows, about 10 to 15 columns. A raw dump of the data, unsorted, comes to around 1gb csv file.
Right now I'm trying to sort the files and "diff" using the code below. The problem is that the "Sort" does not always work. With smaller files and less rows, the sort & diff works, but it doesn't seem to work with very large files.
Also, sorting adds significant operation time; ideally I would like to avoid sorting and just compare ignoring sort order, but I don't know how to do that.
filecmm, difflib and some other functions I tried all require pre-sorted files.
I'm performing a Python Merge Sort right now, but like I said, the sorting doesn't necessarily work on very large number of rows, and I'm hoping for a better way to compare.
This is the python merge sort function:
def batch_sort(self, input, output, key=None, buffer_size=32000, tempdirs=None):
if isinstance(tempdirs, str):
tempdirs = tempdirs.split(",")
if tempdirs is None:
tempdirs = []
if not tempdirs:
tempdirs.append(gettempdir())
chunks = []
try:
with open(input,'rb',64*1024) as input_file:
input_iterator = iter(input_file)
for tempdir in cycle(tempdirs):
current_chunk = list(islice(input_iterator,buffer_size))
if not current_chunk:
break
current_chunk.sort(key=key)
output_chunk = open(os.path.join(tempdir,'%06i'%len(chunks)),'w+b',64*1024)
chunks.append(output_chunk)
output_chunk.writelines(current_chunk)
output_chunk.flush()
output_chunk.seek(0)
with open(output,'wb',64*1024) as output_file:
output_file.writelines(self.merge(key, *chunks))
finally:
for chunk in chunks:
try:
chunk.close()
os.remove(chunk.name)
except Exception:
pass
I can call batch_sort(), give it an input file and output file, size of chunks, and the temporary directory to use.
Once I perform batch_sort() on both files, I can just "diff file1 file2".
The above works with 25,000 to 75,000 rows, but not when we're talking in excess of 14 million rows.