-2

I’m trying to find duplicates in a single csv file by python so through my search I found dedupe.io which is a platform using python and machine learning algorithms to detect records duplicate but it’s not a free tool. However, I don’t want to use the traditional method which the compared columns should specified. I would like to find a way to detect duplicate with a high accuracy. Therefore, is there any tool or python library to find duplicates for text datasets?

  • Here is an example which could clarify that:

      Title, Authors, Venue, Year
      1- Clustering validity checking methods: part II, Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002
      2- Cluster validity methods: part I, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002
      3- Book reviews, Karl Aberer, ACM SIGMOD Record, 2003
      4- Book review column, Karl Aberer, ACM SIGMOD Record, 2003
      5- Book reviews, Leonid Libkin, ACM SIGMOD Record, 2003
    

So, we can decide that records 1 and 2 are not duplicate even though they are contain almost similar data but slightly different in the Title column. Records 3 and 4 are duplicate but record 5 is not referring to the same entity.

matrix
  • 17
  • 2
  • 8
  • give a short example of what you are trying to achieve – dgumo Sep 30 '20 at 09:00
  • use `pandas.DataFrame.drop_duplicates()` [documentation](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-across-multiple-columns-in-python-pandas) – Poe Dator Sep 30 '20 at 09:05
  • Please give more details about the task you try to achieve, are you trying to find exact duplicates or doing record linkage? – A Co Sep 30 '20 at 09:22
  • Does this answer your question? [Python - Display rows with repeated values in csv files](https://stackoverflow.com/questions/24698217/python-display-rows-with-repeated-values-in-csv-files) ; https://stackoverflow.com/questions/4095523/script-to-find-duplicates-in-a-csv-file ; https://stackoverflow.com/questions/40386356/finding-total-number-of-duplicates-in-csv-file?rq=1 ; Did you try any of those? – Tomerikoo Sep 30 '20 at 09:40
  • @A Co it is record linkage – matrix Sep 30 '20 at 10:39

2 Answers2

0

Pandas provides provides a very straightforward way to achieve this pandas.DataFrame.drop_duplicates.

Given the following file (data.csv) stored in the current working directory.

name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
John Doe,25,50000
Louise Jones,25,50000

The following script can be used to remove duplicate records, writing the processed data to a csv file in the current working directory (processed_data.csv).

import pandas as pd

df = pd.read_csv("data.csv")
df = df.drop_duplicates()
df.to_csv("processed_data.csv", index=False)

The resultant output in this example looks like:

name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
Louise Jones,25,50000

pandas.DataFrame.drop_duplicates also allows dropping of duplicate attributes from a specific column (instead of just duplicates of entire rows), column names are specified using the subset argument.

e.g

import pandas as pd

df = pd.read_csv("data.csv")
df = df.drop_duplicates(subset=["age"])
df.to_csv("processed_data.csv", index=False)

Will remove all duplicate values from the age column, maintaining only the first record containing a value duplicated in the age field of later records.

In this example case the output would be:

name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000

JPI93
  • 1,507
  • 5
  • 10
0

Thanks @JPI93 for your answer but some duplicate still exist and didn't removed. I think this method works for the exact duplicate; if this is the case, it's not what i'm looking for. I want to apply record linkage which identify the records that refer to the same entity and then can be removed.

  • If you want to drop duplicates based on a specific column value (instead of just entire row matches) you can pass the column name(s) to be checked by passing the `subset` kwarg to `drop_duplicates()` - [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) – JPI93 Dec 12 '20 at 11:06