Removing duplicates between multiple CSV files

Question

I have multiple CSV files with two columns in each of these CSV files:

Links (Column A)
Description (Column B)

I don't know what the best way would be to remove all duplicates of a link and description when found, leaving only one, so that there is only one instance of the link and description left. It would be best if I could import all of the CSV files at once, there is a possibility that one link appears in multiple CSV files. The link and description is there is a duplicate would be EXACTLY the same. Thanks!

Does this answer your question? [Python lists, csv, duplication removal](https://stackoverflow.com/questions/25486554/python-lists-csv-duplication-removal) — Daniel Brose, May 07 '20 at 05:09
If you're not dedicated to using Python, this is very easy to do in Bash: cat *.csv | sort | uniq > deduped.csv — water_ghosts, May 07 '20 at 05:38

score 2 · Accepted Answer · answered May 07 '20 at 05:34

This can be done by doing a pd.concat followed by drop_duplicates.

import pandas as pd

df1 = pd.read_csv('path/to/file1.csv')
df2 = pd.read_csv('path/to/file2.csv')

df = pd.concat([df1, df2]).drop_duplicates().reset_index(drop=True)

Please refer to the stackoverflow answer here to understand more.

Rexovas · Answer 2 · 2020-05-07T18:58:39.440

This can be achieved with Pandas as follows:

import pandas as pd

df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df = df1.merge(df2, "outer")
print(df)

If you have two files, "file1.csv" consisting of:

and "file2.csv" consisting of:

the output of the above code would be:

    Links               Description
0  movie1  Pirates of the Caribbean
1  movie2                 Star Trek
2  movie3                 Star Wars
3  movie4                James Bond
4  movie5                  Iron Man
5  movie6                 Toy Story

you can export this to a new csv file if you wish via

df.to_csv("file3.csv", index=False)

You may want to investigate the differences between pd.concat and pd.merge. See this link

score 1 · Answer 3 · answered May 07 '20 at 05:32

You can use a set to keep track of which rows you've already seen, regardless of which file the row was in. As you loop through each row in each file, you can ignore any rows already in the set.

Note that the csv module will return each row as a list, which isn't a hashable datatype. To store the row in a set, you'll have to convert it to a tuple first.

import csv

already_seen_rows = set()

with open('output_file.csv', 'w', newline='') as output_file:
    writer = csv.writer(output_file)

    for input_filepath in list_of_input_filepaths:  # os.listdir() might help here
        with open(input_filepath, 'r', newline='') as input_file:
            reader = csv.reader(input_file)

            for row in reader:
                row_tuple = tuple(row)  # Converting to a hashable type so it works with the set

                if row_tuple not in already_seen_rows:
                    writer.writerow(row)
                    already_seen_rows.add(row_tuple)

Removing duplicates between multiple CSV files

3 Answers3