I want to remove non-English words from a sentence in Python 3.x

Question

I have a bunch of user queries. In it there are certain queries which contain junk characters as well, eg. I work in Google asdasb asnlkasn I need only I work in Google

import nltk
import spacy
import truecase
words = set(nltk.corpus.words.words())
nlp = spacy.load('en_core_web_lg')

def check_ner(word):
    doc = nlp(word)
    ner_list = []
    for token in doc.ents:
        ner_list.append(token.text)
    return ner_list



sent = "I work in google asdasb asnlkasn"
sent = truecase.get_true_case(sent)
ner_list = check_ner(sent)

final_sent = " ".join(w for w in nltk.wordpunct_tokenize(sent)if w.lower() in words or not 
w.isalpha() or w in ner_list)

I tried this but this doesn't remove the characters since ner is detecting google asdasb asnlkasn as Work_of_Art or sometimes asdasb asnlkasn as Person. I had to include ner because words = set(nltk.corpus.words.words()) doesn't have Google, Microsoft, Apple etc or any other NER value in the corpus.

Check this question: https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language — Ale, Dec 12 '19 at 10:01

Sushant · Answer 1 · 2019-12-12T10:00:31.177

You can use this to identify your non words.

words = set(nltk.corpus.words.words())

sent = "I work in google asdasb asnlkasn"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())

Try using this. Thanks to @DYZ answer.

However since you said that you need NER for Google, apple etc. and that is causing an incorrect recognition, what you can do is calculate scores for these predictions of NER using beam parse. Then you can use the scores to set a threshold of acceptable value for NER and drop those below it. I believe these meaningless words will get a low probabilistic score for categorization such as person and you can altogether use to drop categories such as Work of art if you don't need them.

An example of using beamparse for scoring:

import spacy
import sys
from collections import defaultdict

nlp = spacy.load(output_dir)
print("Loaded model '%s'" % output_dir)
text = u'I work in Google asdasb asnlkasn'


with nlp.disable_pipes('ner'):
    doc = nlp(text)


threshold = 0.2
(beams) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)

entity_scores = defaultdict(float)
for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

print ('Entities and scores (detected with beam search)')
for key in entity_scores:
    start, end, label = key
    score = entity_scores[key]
    if ( score > threshold):
        print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))

It worked in my testing and NER failed to recognize this.

you should use: `doc[start:end].text` if you want to keep it as a string and not a spacy token — Y4RD13, Mar 30 '21 at 07:31

I want to remove non-English words from a sentence in Python 3.x

1 Answers1