Find difference between a set of csv files in folder A compared to set of csv files in folder B

Question

There are a number a files that need to be compared for differences in their rows; difference not as in subtraction but as in what values are different for each row of a column. Each file is named with its timestamp in it along with information about client as follows: Timecard-MC1010-19-20220507140122-Reported. All the files have a group of columns that are similar for both groups of files. The groups are labeled OlderVersion and NewVersion with a large number of files each. The files are named by their timestamp and in the output when one is showing which record is different, I need to display the timestamp for the file along with the information of the rows that are different for each file as follows: So let's say I have two folders - Folder A and Folder B, with two files in A and one in B. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341

I have managed to show in concatenated dataframe the differenecs in files using the following code. But I don't know how to extract the highlighted differences except for the NAN fields and the timestamp fields from the dataframe. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341

import pandas as pd
import os

path1 = r"C:\\Users\\Bilal\\Python\\Task1\\NewVersionFiles\\"
path2 = r"C:\\Users\\Bilal\\Python\\Task1\\OlderVersionFiles\\"
files1 = [os.path.join(path1, x) for x in os.listdir(path1) if '.csv' in str(x)]
files2 = [os.path.join(path2, x) for x in os.listdir(path2) if '.csv' in str(x)]

li1= []
li2 = []

for filename in files1:
    df1 = pd.read_csv(filename, index_col=None, header=0)
    li1.append(df1)
frame1 = pd.concat(li, axis=0, ignore_index=True)

for filename in files2:
    df2 = pd.read_csv(filename, index_col=None, header=0)
    li2.append(df2)   
frame2 = pd.concat(li, axis=0, ignore_index=True)

dictionaryObject3 =  frame.to_dict()
dictionaryObject4 = frame2.to_dict()

def nested_dict_pairs_iterator(dict_obj):
    ''' This function accepts a nested dictionary as argument
        and iterate over all values of nested dictionaries
    '''
    # Iterate over all key-value pairs of dict argument
    for key, value in dict_obj.items():
        # Check if value is of dict type
        if isinstance(value, dict):
            # If value is dict then iterate over all its values
            for pair in  nested_dict_pairs_iterator(value):
                yield (key, *pair)
        else:
            # If value is not dict type then yield the value
            yield (key, value)

for pair1 in nested_dict_pairs_iterator(dictionaryObject3):
    for pair2 in nested_dict_pairs_iterator(dictionaryObject4):
        if pair1[0] == pair2[0]:
            if pair1[1] == pair2[1]:
// checking if third value is not Nan, otherwise it displays Nan fields only
                if not pd.isna(pair1[2]):
                    if pair1[2] != pair2[2]:
                        print(pair1)
                        print(pair2)

This code displays results as ('Legal', 1066, 'MC1005-3'), where Legal is the column name, 1066 is the row name and 'MC1005-3 is the value for the row-column, followed by pair 2s ('Legal', 1066, 'MC1005-2') but it is slow and I am not even sure if it's correct. Moreover, it's taking an awfully long time to display all the values that are different. Is there a different approach because I feel like I am doing something wrong. And I can't even get to other columns other than Legal. There are so many fields in the dictionaries that it takes a long time to display all the different values for Legal that it doesn't show anything for other columns.

Your codes make your app run slow, you've done 4 loops in total, makes it `2N + N^2` (#cmiiw) of time complexity. First two loops (frame1, frame2), let's say you have like 10 csv files, each file has 100 rows. It makes a total of 2*(10*100) = 2000 executions. Also, you don't need to convert pandas DF to `dict`, just use pandas features like `drop_duplicates()` like in [This Page](https://stackoverflow.com/a/57812527/13379052). — Coderio, May 13 '22 at 06:47
How do I then use the dataframe with dropped duplicates to compare which rows in which files did not have the same entry? So let's say I have two folders - Folder A and Folder B, with two files in A and one in B. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341 — Bilal Hussain, May 13 '22 at 22:05

Allen Paul · Answer 1 · 2022-05-14T04:25:23.443

Your object

From my understanding, your objectives are:

Compare `csv` files in 2 version folders, all filenames are same;
filenames contain date information;
For each file:
- the first row of `csv` files is columns title;
- the rest rows are data to compare;
- their is no column can be used as an nature index.
The program should save result in file rather than print to screen;

some reference

Filter files by name pattern

Use fnmatch.filter(alist, glob_namepattern) to return a list of name with some glob pattern, i.e.

`fnmatch.filter(os.listdir(), '*.csv')`

return all .csv file in current dir.

Transfer string to date (pd.to_dataframe)

You can use pandas general function to convert string or ymd interger to date data. This general function from pandas accept several sequence arguments as input:

sequence of int, ordered by "year", "month", "day" if a DataFrame is provided,
string of date in localized format to set your local from your platform default, import locale module and set as follow:

``` import locale locale.setlocale(locale.ALL='') ```

Nested `for...` Loop (causing slow)

When iter over 2 iterable with same size (length), you use a nested `for...` structure, that costs n*n times of loop. There's 574 csv files in each version folder, so it use 574*574 loop, this is main cause of slow. Instead, use zip() to pair iterable with same size, then iter on the new zip object, this costs n times of loop and save n*(n-1) times from nested loop.

``` for old, new in zip(olds, news): my~functocompare~(old, new) # loop body here ```

example code

It's is quite long go download from github: https://github.com/DiligentDolphin/stackoverflow_answer and find the folder with your quesion id: 72223986

TL;DR;

run `gen_testmaterial_.py`, this generate a test material for the `example.py`
edit `example.py` if you change path in `gen` file
locate function `test_multiplefiles`, change path_new, path_old to corresponding dir
change the dict `kwargs_read_csv` to pass in the `pd.read_csv` keywords arguments
change variable `result_outpath` to choose compare output csv file path
run `example.py`

Explain

Compare file to file

I assume your requirement do not includes crossing file compare, so there is no need to concat all csv in one dir together then compared to the other folder. Instead, compare by filename and found if that name also exists in other folder, then compare then in detail.

Standarize DataFrame

Once you read from csv, reshape all its columns to index by pd.stack, thus create a 3-columns fixed DataFrame: (Index, Column, Value), which later you can use to compare DataFrame with same Index / Column name but not same shape or sequence.

When File not exist

In gen_test_material, I make both NewVersion and OldVersion dir lacking some filenames from each other, to simulate when some version of file only exists in one dir. In this case, construct a None-filled DataFrame with the same shape, then compare function will return all values as diff.

Filename parse to datatime

Please refer to re python builtin module about regex, and use online tool such as https://regex101.com/ to verify your regex expression. Then alter it in function to_timestamp. I have update corresponding code from the amend information, but you can always change it per demands.

all the filenames are not the same. Each file is named by its timestamp along with information about client number and their legal number as such Timecard-MC1003-1-20220425100254-Reported. I am not sure what your code is doing, it's so extensive. For each row of a file that's being compared, I need to display the timestamp i.e filename alongside. — Bilal Hussain, May 13 '22 at 21:54
So let's say I have two folders - Folder A and Folder B, with two files in A and one in B. After running the program it should display the following: Folder A had a record in file "FolderA_Timecard-MC1010-19-20220507140122-Reported" with row number 2 that was not present in Folder B. Record is: MC1010-19 21044174 58.55 12341 — Bilal Hussain, May 13 '22 at 21:59
"all the filenames are not the same..." I add some regex to extract timestamp string from filename, see updated code in github "So let's say I have two folders..." Actually, this by-file comparison doesn't care a certain line of records is saved in same filename, because: - It first gather all filenames from both folder (func main, line 355) - It find exists filenames in each folders (func compare_file, line 300) - When file exists only in one side, it make a null DataFrame with same shape to compare with (func compare_file, line 312 to 324) — Allen Paul, May 14 '22 at 04:33