How to Remove non word Characters in NLP

Question

I am having problem with my regex code

round1 = re.sub('\W+', '\n', stringFilter )

it doesn't remove non word characters

example output: s , word , does , au

cleaned output: word , does

Does this answer your question? [How to check if a word is an English word with Python?](https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python) — Davide Fiocco, Nov 03 '20 at 10:40
You can use nltk library. I don't have it installed now, but it should work. — mulaixi, Nov 03 '20 at 10:42
@DavideFiocco No , it has same function of regex and corpus words, It still recognize , 's' as English word, even though it is not a word. — Akio Saito, Nov 03 '20 at 10:56
@mulaixi I already using NLTK but what kind of Library did you mean , still trying Corpus.words but I cant still remove the 's' character and other character — Akio Saito, Nov 03 '20 at 10:57
If nltk cannot satisfy I don't know what could satisfy. maybe you can check spacy. you can compare with its vocabulary. @AkioSaito https://stackoverflow.com/questions/54495502/how-to-get-all-words-from-spacy-vocab — mulaixi, Nov 03 '20 at 11:04

score 0 · Answer 1 · answered Nov 03 '20 at 11:16

0

Perhaps you can do something fancy in an NLP pipeline (https://stanfordnlp.github.io/stanza/) such as filtering out words that don't have POS tags, lemmatizing, etc. You could also check the corpus in NLTK and if the word is not in it, you can discard it.

answered Nov 03 '20 at 11:16

Adam

46
3

I'm already done in tokenizing and lemmatizing also stopwords and I've been trying the copus.words(), and still getting letters/character like 's'. My code is : if data in words: print(data) . I'm think about how should I filter or clean this data – Akio Saito Nov 03 '20 at 13:01
Yes I tried lemmatizing and pos tagging and i don't think it'd work. If that's the case, you can try as @mulaixi suggested. This might help too. Maybe if you could tell us what the problem is with nltk? https://stackoverflow.com/questions/41290028/removing-non-english-words-from-text-using-python – Adam Nov 03 '20 at 15:59

How to Remove non word Characters in NLP

1 Answers1