I am having problem with my regex code
round1 = re.sub('\W+', '\n', stringFilter )
it doesn't remove non word characters
example output: s , word , does , au
cleaned output: word , does
I am having problem with my regex code
round1 = re.sub('\W+', '\n', stringFilter )
it doesn't remove non word characters
example output: s , word , does , au
cleaned output: word , does
Perhaps you can do something fancy in an NLP pipeline (https://stanfordnlp.github.io/stanza/) such as filtering out words that don't have POS tags, lemmatizing, etc. You could also check the corpus in NLTK and if the word is not in it, you can discard it.