How can I catch all the accented characters using regex in order to remove duplicates in my column?

Question

I am working on this French monuments dataset here, the tico column contains a number of duplicates in French and English.

Some of these duplicates are difficult to catch because they contain accented letters. For example: you might find maisonduesiècle and maisonduesiecle. These are duplicates, one contains an accented e the other doesn't. There are other accented vowels in French such as ç,é,â,ê,î,ô,û,à,è,ì,ò,ù,ë,ï,ü that may be present in some but absent in other monument names. How can I filter these duplicates?

Singular and plural forms of the same noun are also to be considered duplicates. For example, maison and maisons are duplicates.

Some names may also contain typo e.g maiso2n is maison with a typo.

I have tried to address this by the code below:

import pandas as pd
monuments = pd.read_csv("data/liste-des-immeubles-proteges-au-titre-des-monuments-historiques.csv", sep=";")   

name = ''
sieve=[]
for i in range(len(monuments['tico'])):
    for k in range(len(monuments['tico'][i])):
        if monuments['tico'][i][k].isalpha():
            name+=monuments['tico'][i][k].lower()
        else:
            pass
    if (name not in sieve) and (name+'s' not in sieve):
        sieve.append(name)
        name = ''
    else:
        name=''
print(len(sieve))

However, I get 25399, instead of 23874 as the number of unique monuments. sieve is a list I created to store the unique monuments. Any help will be appreciated.

Edit: The link to the dataset has been updated.

There is no `tico` column in the linked file. Please double check your example — mozway, Nov 24 '22 at 09:37
**1st** maybe you need [`strip_accents`](https://stackoverflow.com/a/518232/3439404) (words like `théâtre` vs `théatre` vs `theatre` etc), and **2nd** `if (name not in sieve) and (name+'s' not in sieve)` detects only _overall plurals_ like `maison` and `maisons` and doesn't detect _inner plurals_ like `Maison Renaissance` vs `Maisons Renaissance` (hypothetical examples). — JosefZ, Nov 24 '22 at 14:58

How can I catch all the accented characters using regex in order to remove duplicates in my column?

0 Answers0