I am writing a program to compare all files and directories between two filepaths (basically the files metadata, content, and internal directories should match)
File content comparison is done row by row. Dimensions of the csv may or may not be the same, but below approaches generally manages scenerios whereby dimensions are not the same.
The problem is that processing time is too slow.
Some context:
- The two files are identified to be different using filecmp
- This particular problematic csv is ~11k columns and 800 rows.
- My program will not know what is the data type within the csv beforehand, so defining the dtype for pandas is not an option
- Difflib does an excellent job if the csv file is small, but not for this particular usecase
I've looked at all the related questions on SO, and tried these approaches, but the processing time was terrible. Approach 3 gives weird results
Approach 1 (Pandas) - Terrible wait and I keep getting this error
UserWarning: You are merging on int and float columns where the float values are not equal to their int representation.
import pandas as pd
import numpy as np
df1 = pd.read_csv(f1)
df2 = pd.read_csv(f2)
diff = df1.merge(df2, how='outer', indicator='exists').query("exists!='both'")
print(diff)
Approach 2 (Difflib) - Terrible wait for this huge csv
import difflib
def CompareUsingDiffLib(f1, f2 ):
html = h.make_file(file1_lines, file2_lines, context=True,numlines=0)
htmlfilepath = filePath + "\\htmlFiles"
with open(htmlfilepath, 'w') as fh:
fh.write(html)
with open (file1) as f, open(file2) as z:
f1 = f.readlines()
f2 = z.readlines()
CompareUsingDiffLib(f1, f2 )
Approach 3 (Pure python) - Incorrect results
with open (f1) as f, open(f2) as z:
file1 = f.readlines()
file2 = z.readlines()
# check row number of diff in file 1
for line in file1:
if line not in file2:
print(file1.index(line))
# it shows from all the row from row number 278 to last row
# is not in file 2, which is incorrect
# I checked using difflib, and using excel as well
# no idea why the results are like that
# running below code shows the same result as the first block of code
for line in file2:
if line not in file1:
print(file2.index(line))
Approach 4 (csv-diff) - Terrible wait
from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv")),
load_csv(open("two.csv"))
)
Can anybody please help on either:
- An approach with less processing time
- Debugging Approach 3