Python , get duplicates in 1st column of all csv files in a directory

Question

import pandas as pd
import glob

dataset = pd.read_csv('masterfeedproduction-EURNA_2016-06-27.csv',sep = 
',',delimiter = None) # select 1 file in the directory
datasets_cols = ['transactionID','gvkey','companyName']

df= dataset.transactionID
df.shape
df.loc[df.duplicated()]

returns the duplicates in the selected file. displays row number and transactionID. so this is correct.

target_directory = r'C:\Users\nikol\Downloads\fullDailyDeltas\fullDailyDeltas'
file_list = glob.glob(target_directory + "/*.csv")

df_result = df.loc[df.duplicated()]

for file in file_list:
   return(df_result)

here I am stuck.

target_directory = r'C:\Users\nikol\Downloads\fullDailyDeltas\fullDailyDeltas'
file_list = glob.glob(target_directory + "/*.csv")


for file in file_list:
    dataset = pd.read_csv(file)
    df = dataset.transactionID
    duplicated = df.loc[df.duplicated()]
    if duplicated.empty == False:
        print(file)
        print(duplicated)

sudonym · Answer 1 · 2018-06-12T07:45:45.490

0

Have a look at the glob module.

import pandas as pd
import glob

def your_function(file):   
    # put your df processing logic here
    return df_result

Step 1 - Create list of files in directory

target_directory = r'Path/to/your/dir'
file_list = glob.glob(target_directory + "/*.csv") 
# Include slash or it will search in the wrong directory!!

Step 2 - Loop through files in list

for file in file_list:                # Loop files
    df_result = your_function(file)   # Put your logic into a separate function
    new_filename = file.replace('.csv', '_processed.csv')
    df_result.to_csv(new_filename, index = False)

Comment

In case you would have included your code showing your attempts to do this yourself, your question was answered within seconds.

edited Jun 12 '18 at 07:45

answered Jun 12 '18 at 01:23

sudonym

3,788
4
36
61

Thanks for your answer,appreciated. Im not experienced in python. So let me try to be more clear. I have defined dataset as a single file , with first column only : dataset = pd.read_csv('masterfeedproduction-EURNA_2016-06-27.csv',sep = ',',delimiter = None) . following function : df= dataset.transactionID , df.loc[df.duplicated()] # this returns the duplicates in this particular file , returning row number and the transcationID. So what is left for me is to do the loop through all 700 files separately and return the transactionID which are duplicated within their respective file only – Kaloyan Kolev Jun 12 '18 at 11:28
import pandas as pd import glob dataset = pd.read_csv('masterfeedproduction-EURNA_2016-06-27.csv',sep = ',',delimiter = None) datasets_cols = ['transactionID','gvkey','companyName'] df= dataset.transactionID df.shape df.loc[df.duplicated()] target_directory = r'C:\Users\nikol\Downloads\fullDailyDeltas\fullDailyDeltas' file_list = glob.glob(target_directory + "/*.csv") df_result = df.loc[df.duplicated()] for file in file_list: return(df_result) – Kaloyan Kolev Jun 12 '18 at 11:33
You need do define the function. Look at the def part. Put all your df Logic to line below def. Read up on functions. – sudonym Jun 12 '18 at 12:23
target_directory = r'C:\Users\nikol\Downloads\fullDailyDeltas\fullDailyDeltas' file_list = glob.glob(target_directory + "/*.csv") for file in file_list: dataset = pd.read_csv(file) df = dataset.transactionID duplicated = df.loc[df.duplicated()] if duplicated.empty == False: print(file) print(duplicated) – Kaloyan Kolev Jun 12 '18 at 14:53
Don't put code in the comments. Edit your question and format it properly. – sudonym Jun 13 '18 at 00:27

Python , get duplicates in 1st column of all csv files in a directory

1 Answers1