Issue with python script to remove duplicate rows from csv files in a folder

Question

I am a beginner at python. I am writing a script to :

Read all csv files in a folder
Drop duplicate rows within a .csv file by reading one csv file at a time
Write to *_new.csv file

The code :

import csv
import os
import pandas as pd

path = "/Users/<mylocaldir>/Documents/Data/"
file_list = os.listdir(path)
for file in file_list:
fullpath = os.path.join(path, file)
data = pd.read_csv(fullpath)
newdata = data.drop_duplicates()
newfile = fullpath.replace(".csv","_new.csv")
newdata.to_csv ("newfile", index=True, header=True)

As I run the script, there is no error displayed. But, *_new.csv is not created

Any help to resolve this issue?

https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas might help you. You might have to do more than drop_duplicates() it seems. — zedfoxus, Jun 06 '20 at 20:48
@aguest0606 You are welcome to put closure to your question by marking one of the questions as accepted. You can do that by click on a tick mark by the answer of your choice. — zedfoxus, Jun 08 '20 at 17:40

score 0 · Answer 1 · answered Jun 06 '20 at 21:06

I don't know pandas but you don't need it. You could try something like this:

import os

file_list = os.listdir()

# loop through the list
for filename in file_list:

    # don't process any non csv file
    if not filename.endswith('.csv'):
        continue

    # lines will be a temporary holding spot to check 
    # for duplicates
    lines = []
    new_file = filename.replace('.csv', '_new.csv')

    # open 2 files - csv file and new csv file to write
    with open(filename, 'r') as fr, open(new_file, 'w') as fw:

        # read line from csv
        for line in fr:

            # if that line is not in temporary list called lines,
            #   add it there and write to file
            # if that line is found in temporary list called lines,
            #   don't do anything
            if line not in lines:
                lines.append(line)
                fw.write(line)

print('Done')

Result

Original file

cat name.csv
id,name
1,john
1,john
2,matt
1,john

New file

cat name_new.csv 
id,name
1,john
2,matt

Another original file

cat pay.csv
id,pay
1,100
2,300
1,100
4,400
4,400
2,300
4,400

It's new file

id,pay
1,100
2,300
4,400

score 0 · Accepted Answer · answered Jun 06 '20 at 21:34

Update

The following script works with a slight modification to read from Src folder and write to Dest folder :

import cdv
import os
import pandas as pd

path = "/Users/<localdir>/Documents/Data/Src"
newPath = "/Users/<localdir>/Documents/Data/Dest"
file_list = os.listdir(path)

for file in file_list:
fullpath = os.path.join(path, file)
data = pd.read_csv(fullpath)
newdata = data.drop_duplicates()
newfile = file.replace(".csv","_new.csv")
if not os.path.isfile(os.path.join(newPath, newfile)):
    newdata.to_csv (os.path.join(newPath, newfile), index=False, header=True)

I also added a check to see if a file already exists in the Dest folder.

I will be keen to understand if there is a better way to write this script.

Issue with python script to remove duplicate rows from csv files in a folder

2 Answers2