0

I have a dataframe containing ~400000 rows and multiple columns. On of these columns contain strings of text. After some initial text cleaning I end up with the following subset of my dataframe:

Data cleaning

from nltk.corpus import stopwords
from unidecode import unidecode

stopwords_nl = set(stopwords.words("dutch"))


def clean_text(df, stopwords):
    
    regex_rules = {
        # remove linebreaks
        r"\n": " ", 
        # remove return characters
        r"\r": " ",  
        # remove any non-numerical characters
        r"[^a-zA-Z0-9]": " ", 
        # replace multiple spaces by one
        r"\s+": " ",
        # replace most common used words
        r"(?i)\bverv\w*(?:\b|\.)": "vervangen",
        r"(?i)\bherst\w*(?:\b|\.)": "herstellen",
        r"(?i)\bcons\w*(?:\b|\.)": "conserveren",
        r"(?i)\bonderh\w*(?:\b|\.)": "onderhouden",
        r"(?i)\brepar\w*(?:\b|\.)": "herstellen",
        r"(?i)\bgara\w*(?:\b|\.)": "garantie",
        r"(?i)\brevi\w*(?:\b|\.)": "reviseren",
    }

    stopword_pattern = {"|".join([r"\b{}\b".format(w) for w in stopwords_nl]): ""}
    
    return (df
        # convert to lowercase
        .assign(text_cleaned=lambda df_:
                df_.Maatregel_naam.astype(str).str.lower())
        # remove accents from letters and remove any non-ascii characters
        .assign(text_cleaned=lambda df_: 
                  df_.text_cleaned.apply(lambda x: unidecode(x)))
        # remove stopwords
        .assign(text_cleaned=lambda df_: 
                  df_.text_cleaned.replace(stopword_pattern, regex=True))
        # use regex rules to replace text that we are not interested in
        .assign(text_cleaned=lambda df_: 
                  df_.text_cleaned.replace(regex_rules, regex=True))
       
        )
                    
        
df = clean_text(DISK_data, stopwords_nl)

Remaining misspelled words

df subset:

text_cleaned
1   reviseren hydraulische aandrijving
2   vervangen aandrijfing bewegingsw. 
3   conserveren aandr bew werk voetgangersbrug
4   reviseren hydraulische aandrijving voetgangersbr

There are still some misspelled words, abbreviations or technical words in the cleaned dataframe such as:

  • "aandrijfing" must be "aandrijving"
  • "bewegingsw." must be "bewegingswerk"
  • "aandr" must be "aandrijving"
  • "bew" must be "bewegingswerk"
  • "voetgangsbr" must be "voetgangersbrug"

Reference posts:

  • https://stackoverflow.com/a/67934983/17931594
  • https://stackoverflow.com/questions/24078723/replace-word-w-r-t-word-in-another-column-using-levenshtein-distance
  • https://stackoverflow.com/q/56488402/17931594
  • https://www.appsloveworld.com/pandas/100/32/how-to-replace-misspelled-words-in-a-pandas-dataframe

Tried solutions

The code of some post took either too much running time or it contained a solution for only one word as string instead of multiple words in one string of a dataframe. The common solutions with pyspellchecker and autocorrect library are not working because the language is Dutch instead of English

I also tried to add a Dutch dictionary from https://github.com/OpenTaal/opentaal-wordlist to the autocorrect library, so that autocorrect could replace the misspelled words in my dataframe. But this didn't work either. See code below:

os.chdir(r"C:\Users\datalab-c01\Documents\PIHP\data")
lines = open("wordlist.txt", encoding="utf8").read().splitlines()
df_lines = pd.DataFrame(lines, columns=["words"])
dictionary_words_nl = {"words": lines}
       

from autocorrect import Speller

spell = Speller(nlp_data=dictionary_words_nl)
spell("aandrijfing of aandri is moeilijk om te spellen. het moet aandrijving zijn.")

The difficulty also is the Dutch language.

How can I replace the misspelled words or technical language in each string of the dataframe with correct words (Dutch)?

Solution suggestion

Split the string of the column "text_cleaned" in the dataframe. Drop duplicates and use spellchecker to correct the misspelled words. Replace the corrected words in the splitted string and glue everything together into the orginal string.

Any help is appreciated.

  • 1
    marginal tangent: you have a duplicate replace rule for `reviseren`. Also, that's pretty Nederengels, you might want "aanpassen" instead. – Mike 'Pomax' Kamermans Jul 18 '23 at 17:53
  • dataframe.column_name.str.replace allows for replacing inside the strings of a dataframe column. – error Jul 18 '23 at 18:15
  • Vectorized way of doing this is to explode strings into words, replace them using a dictionary, glue back together. Or, write a function to process one string (possibly doing the same), and apply to all cells. – Marat Jul 18 '23 at 18:38
  • I was also thinking using Levenshtein distance to match the words of the (exploded) string with a dictionary, but I couldn't find a good solution on stackoverflow. – user22247751 Jul 18 '23 at 18:45

0 Answers0