I am working on this French monuments dataset here, the tico
column contains a number of duplicates in French and English.
Some of these duplicates are difficult to catch because they contain accented letters. For example: you might find maisonduesiècle
and maisonduesiecle
. These are duplicates, one contains an accented e
the other doesn't. There are other accented vowels in French such as ç,é,â,ê,î,ô,û,à,è,ì,ò,ù,ë,ï,ü
that may be present in some but absent in other monument names. How can I filter these duplicates?
Singular and plural forms of the same noun are also to be considered duplicates. For example, maison
and maisons
are duplicates.
Some names may also contain typo e.g maiso2n
is maison
with a typo.
I have tried to address this by the code below:
import pandas as pd
monuments = pd.read_csv("data/liste-des-immeubles-proteges-au-titre-des-monuments-historiques.csv", sep=";")
name = ''
sieve=[]
for i in range(len(monuments['tico'])):
for k in range(len(monuments['tico'][i])):
if monuments['tico'][i][k].isalpha():
name+=monuments['tico'][i][k].lower()
else:
pass
if (name not in sieve) and (name+'s' not in sieve):
sieve.append(name)
name = ''
else:
name=''
print(len(sieve))
However, I get 25399
, instead of 23874
as the number of unique monuments. sieve
is a list I created to store the unique monuments. Any help will be appreciated.
Edit: The link to the dataset has been updated.