0

I'm really new to python, so please bear with me!

I have folder on my desktop that contains a few csv files named as "File 1.csv", "File 2.csv" and so on. In each file, there is a table that looks like:

    Animal   Level
    Cat      1
    Dog      2
    Bird     3
    Snake    4

But each one of the files has a few differences in the "Animal" column. I wrote the following code that compares only two files at a time and returns the animals that match:

def matchlist(file1, file2): 
    new_df = pd.DataFrame()
    file_one = pd.read_csv(file1)
    file_two = pd.read_csv(file2)
    for i in file_one["Animal"]:
        df_temp = file_two[file_two["Animal"] == i]
        new_df = new_df.append(df_temp)
        df_temp = pd.DataFrame()
   return new_df

But that only compares two files at a time. Is there a way that will iterate through all the files in that single folder and return all the ones that match to the new_df above?

For example, new_df compares file 1 and file 2. Then, I am looking for code that compares new_df to file 3, file 4, file 5, and so on.

Thanks!

2 Answers2

0

Im not sure if it is really what you want, i can't comment yet on you questions... so:

this function returns a dataframe with animals that can be found in all csv files (can be very small) it uses the animal name as key, so the level value will not be considered

import pandas as pd
import os, sys

def matchlist_iter(folder_path):

    # get list with filenames in folder and throw away all non ncsv
    files = [file_path for file_path in os.listdir(folder_path) if file_path.endswith('.csv')]

    # init return df with first csv
    df = pd.read_csv(os.path.join(folder_path, files[0]), )

    for file_path in files[1:]:
        print('compare: {}'.format(file_path))
        df_other = pd.read_csv(os.path.join(folder_path, file_path))

        # only keep the animals that are in both frames
        df = df_other[df['Animal'].isin(df_other['Animal'])]

    return df

if __name__ == '__main__':
    matched = matchlist_iter(sys.argv[1])
    print(matched)

i've found a similar question with more answerers regarding the match here: Compare Python Pandas DataFrames for matching rows

EDIT: added csv and sample output

csv

Animal,  Level
Cat,      1
Dog,      2
Bird,     3
Snake,    4

csv

Animal, Level
Cat,      1
Parrot,   2
Bird,     3
Horse,    4

output

compare: csv2.csv
  Animal   Level
0    Cat       1
2   Bird       3
mumbala
  • 121
  • 10
0

I constructed a set of six files in which only the first columns of File 1.csv and File 6.csv are identical.

You need only the first column of each csv for comparison purposes, I therefore arrange to extract only those from each file.

>>> import pandas as pd
>>> from pathlib import Path

>>> column_1 = pd.read_csv('File 1.csv', sep='\s+')['Animal'].tolist()
>>> column_1
['Cat', 'Dog', 'Bird', 'Snake']
>>> for filename in Path('.').glob('*.csv'):
...     if filename.name == 'File 1.csv':
...         continue
...     next_column = pd.read_csv(filename.name, sep='\s+')['Animal'].tolist()
...     if column_1 == next_column:
...         print (filename.name)
... 
File 6.csv

As expected, File 6.csv is the only file found to be identical (in the first column) to File 1.csv.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58