I have a bunch of user queries. In it there are certain queries which contain junk characters as well, eg. I work in Google asdasb asnlkasn
I need only I work in Google
import nltk
import spacy
import truecase
words = set(nltk.corpus.words.words())
nlp = spacy.load('en_core_web_lg')
def check_ner(word):
doc = nlp(word)
ner_list = []
for token in doc.ents:
ner_list.append(token.text)
return ner_list
sent = "I work in google asdasb asnlkasn"
sent = truecase.get_true_case(sent)
ner_list = check_ner(sent)
final_sent = " ".join(w for w in nltk.wordpunct_tokenize(sent)if w.lower() in words or not
w.isalpha() or w in ner_list)
I tried this but this doesn't remove the characters since ner is detecting google asdasb asnlkasn
as Work_of_Art
or sometimes asdasb asnlkasn
as Person.
I had to include ner because words = set(nltk.corpus.words.words())
doesn't have Google, Microsoft, Apple etc or any other NER value in the corpus.