I have a dataframe containing ~400000 rows and multiple columns. On of these columns contain strings of text. After some initial text cleaning I end up with the following subset of my dataframe:
Data cleaning
from nltk.corpus import stopwords
from unidecode import unidecode
stopwords_nl = set(stopwords.words("dutch"))
def clean_text(df, stopwords):
regex_rules = {
# remove linebreaks
r"\n": " ",
# remove return characters
r"\r": " ",
# remove any non-numerical characters
r"[^a-zA-Z0-9]": " ",
# replace multiple spaces by one
r"\s+": " ",
# replace most common used words
r"(?i)\bverv\w*(?:\b|\.)": "vervangen",
r"(?i)\bherst\w*(?:\b|\.)": "herstellen",
r"(?i)\bcons\w*(?:\b|\.)": "conserveren",
r"(?i)\bonderh\w*(?:\b|\.)": "onderhouden",
r"(?i)\brepar\w*(?:\b|\.)": "herstellen",
r"(?i)\bgara\w*(?:\b|\.)": "garantie",
r"(?i)\brevi\w*(?:\b|\.)": "reviseren",
}
stopword_pattern = {"|".join([r"\b{}\b".format(w) for w in stopwords_nl]): ""}
return (df
# convert to lowercase
.assign(text_cleaned=lambda df_:
df_.Maatregel_naam.astype(str).str.lower())
# remove accents from letters and remove any non-ascii characters
.assign(text_cleaned=lambda df_:
df_.text_cleaned.apply(lambda x: unidecode(x)))
# remove stopwords
.assign(text_cleaned=lambda df_:
df_.text_cleaned.replace(stopword_pattern, regex=True))
# use regex rules to replace text that we are not interested in
.assign(text_cleaned=lambda df_:
df_.text_cleaned.replace(regex_rules, regex=True))
)
df = clean_text(DISK_data, stopwords_nl)
Remaining misspelled words
df subset:
text_cleaned
1 reviseren hydraulische aandrijving
2 vervangen aandrijfing bewegingsw.
3 conserveren aandr bew werk voetgangersbrug
4 reviseren hydraulische aandrijving voetgangersbr
There are still some misspelled words, abbreviations or technical words in the cleaned dataframe such as:
- "aandrijfing" must be "aandrijving"
- "bewegingsw." must be "bewegingswerk"
- "aandr" must be "aandrijving"
- "bew" must be "bewegingswerk"
- "voetgangsbr" must be "voetgangersbrug"
Reference posts:
- https://stackoverflow.com/a/67934983/17931594
- https://stackoverflow.com/questions/24078723/replace-word-w-r-t-word-in-another-column-using-levenshtein-distance
- https://stackoverflow.com/q/56488402/17931594
- https://www.appsloveworld.com/pandas/100/32/how-to-replace-misspelled-words-in-a-pandas-dataframe
Tried solutions
The code of some post took either too much running time or it contained a solution for only one word as string instead of multiple words in one string of a dataframe. The common solutions with pyspellchecker and autocorrect library are not working because the language is Dutch instead of English
I also tried to add a Dutch dictionary from https://github.com/OpenTaal/opentaal-wordlist to the autocorrect library, so that autocorrect could replace the misspelled words in my dataframe. But this didn't work either. See code below:
os.chdir(r"C:\Users\datalab-c01\Documents\PIHP\data")
lines = open("wordlist.txt", encoding="utf8").read().splitlines()
df_lines = pd.DataFrame(lines, columns=["words"])
dictionary_words_nl = {"words": lines}
from autocorrect import Speller
spell = Speller(nlp_data=dictionary_words_nl)
spell("aandrijfing of aandri is moeilijk om te spellen. het moet aandrijving zijn.")
The difficulty also is the Dutch language.
How can I replace the misspelled words or technical language in each string of the dataframe with correct words (Dutch)?
Solution suggestion
Split the string of the column "text_cleaned" in the dataframe. Drop duplicates and use spellchecker to correct the misspelled words. Replace the corrected words in the splitted string and glue everything together into the orginal string.
Any help is appreciated.