I’m trying to find duplicates in a single csv file by python so through my search I found dedupe.io which is a platform using python and machine learning algorithms to detect records duplicate but it’s not a free tool. However, I don’t want to use the traditional method which the compared columns should specified. I would like to find a way to detect duplicate with a high accuracy. Therefore, is there any tool or python library to find duplicates for text datasets?
Here is an example which could clarify that:
Title, Authors, Venue, Year 1- Clustering validity checking methods: part II, Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002 2- Cluster validity methods: part I, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002 3- Book reviews, Karl Aberer, ACM SIGMOD Record, 2003 4- Book review column, Karl Aberer, ACM SIGMOD Record, 2003 5- Book reviews, Leonid Libkin, ACM SIGMOD Record, 2003
So, we can decide that records 1 and 2 are not duplicate even though they are contain almost similar data but slightly different in the Title column. Records 3 and 4 are duplicate but record 5 is not referring to the same entity.