There are a number a files that need to be compared for differences in their rows; difference not as in subtraction but as in what values are different for each row of a column. Each file is named with its timestamp in it along with information about client as follows: Timecard-MC1010-19-20220507140122-Reported. All the files have a group of columns that are similar for both groups of files. The groups are labeled OlderVersion and NewVersion with a large number of files each. The files are named by their timestamp and in the output when one is showing which record is different, I need to display the timestamp for the file along with the information of the rows that are different for each file as follows: So let's say I have two folders - Folder A and Folder B, with two files in A and one in B. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341
I have managed to show in concatenated dataframe the differenecs in files using the following code. But I don't know how to extract the highlighted differences except for the NAN fields and the timestamp fields from the dataframe. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341
import pandas as pd
import os
path1 = r"C:\\Users\\Bilal\\Python\\Task1\\NewVersionFiles\\"
path2 = r"C:\\Users\\Bilal\\Python\\Task1\\OlderVersionFiles\\"
files1 = [os.path.join(path1, x) for x in os.listdir(path1) if '.csv' in str(x)]
files2 = [os.path.join(path2, x) for x in os.listdir(path2) if '.csv' in str(x)]
li1= []
li2 = []
for filename in files1:
df1 = pd.read_csv(filename, index_col=None, header=0)
li1.append(df1)
frame1 = pd.concat(li, axis=0, ignore_index=True)
for filename in files2:
df2 = pd.read_csv(filename, index_col=None, header=0)
li2.append(df2)
frame2 = pd.concat(li, axis=0, ignore_index=True)
dictionaryObject3 = frame.to_dict()
dictionaryObject4 = frame2.to_dict()
def nested_dict_pairs_iterator(dict_obj):
''' This function accepts a nested dictionary as argument
and iterate over all values of nested dictionaries
'''
# Iterate over all key-value pairs of dict argument
for key, value in dict_obj.items():
# Check if value is of dict type
if isinstance(value, dict):
# If value is dict then iterate over all its values
for pair in nested_dict_pairs_iterator(value):
yield (key, *pair)
else:
# If value is not dict type then yield the value
yield (key, value)
for pair1 in nested_dict_pairs_iterator(dictionaryObject3):
for pair2 in nested_dict_pairs_iterator(dictionaryObject4):
if pair1[0] == pair2[0]:
if pair1[1] == pair2[1]:
// checking if third value is not Nan, otherwise it displays Nan fields only
if not pd.isna(pair1[2]):
if pair1[2] != pair2[2]:
print(pair1)
print(pair2)
This code displays results as ('Legal', 1066, 'MC1005-3'), where Legal is the column name, 1066 is the row name and 'MC1005-3 is the value for the row-column, followed by pair 2s ('Legal', 1066, 'MC1005-2') but it is slow and I am not even sure if it's correct. Moreover, it's taking an awfully long time to display all the values that are different. Is there a different approach because I feel like I am doing something wrong. And I can't even get to other columns other than Legal. There are so many fields in the dictionaries that it takes a long time to display all the different values for Legal that it doesn't show anything for other columns.