-1

I have found some non-English words in my dictionary (from CountVectorizer) that I would like to remove:

 verified={'日本': '19 日本',
 'له': 'إستعداد له',
 'لسنا': 'القادم لسنا',
 'غيتس': 'بيل غيتس',
 'على': 'على إستعداد',
 'بيل': 'بيل غيتس',
 'الوباء': 'الوباء القادم',
 'إستعداد': 'إستعداد له',
 'és': 'koronavírus és',
 'állnak': 'kik állnak',
 'zu': 'könig zu',
 'zero': 'agenda zero'}

My attempt was to use nltk, specifically words:

import nltk
words = set(nltk.corpus.words.words())

not_en_list = [x for x, v in verified.items() if v!='[]' if x not in words]

But when I ran it, no changes were applied. Still non-English words there. Please note that the example I provided is a sample of data: I have thousands of English words, but a few of non-English words that I would like to delete, without copying and pasting the list.

  • Does this answer your question? [Removing non-English words from text using Python](https://stackoverflow.com/questions/41290028/removing-non-english-words-from-text-using-python) – tbhaxor Oct 11 '20 at 19:15
  • I already applied the answer they proposed there (as you can see in my question and attempt). It has not worked at all in my case –  Oct 11 '20 at 19:16

2 Answers2

0

No changes are applied since you are not modifying any existing data structure. not_en_list will be made but verified will not be modified. Try this instead, and if not please post a minimum working example.

raw =  {'日本': '19 日本',
 'له': 'إستعداد له',
 'لسنا': 'القادم لسنا',
 'غيتس': 'بيل غيتس',
 'على': 'على إستعداد',
 'بيل': 'بيل غيتس',
 'الوباء': 'الوباء القادم',
 'إستعداد': 'إستعداد له',
 'és': 'koronavírus és',
 'állnak': 'kik állnak',
 'zu': 'könig zu',
 'zero': 'agenda zero'}

words = set(['zero'])
verified = {k: v for k, v in raw.items() if k in words}
assert verified == {'zero': 'agenda zero'}
spagh-eddie
  • 124
  • 9
  • thanks spaghEddie. If I understood correctly what you did is to include zero in a list in order to select that item. But what if I had many elements? –  Oct 11 '20 at 19:10
  • `words = set(['zero', 'any', 'other', 'words', 'you', 'would', 'like'])` – spagh-eddie Oct 11 '20 at 19:18
  • I think this cannot be doable. If I had thousands of words I would not be able to list them. So probably it would be better using an approach with detection language at this point. Thanks anyway –  Oct 11 '20 at 19:19
  • replace it with `words = set(nltk.corpus.words.words())` – spagh-eddie Oct 12 '20 at 19:16
0

Maybe this can help you:

import nltk
import ast
#nltk.download('words')
'''-> Remove HashTag if the word list has not been downloaded'''
dict_ = {'日本': '19 日本',
         'له': 'إستعداد له',
         'لسنا': 'القادم لسنا',
         'غيتس': 'بيل غيتس',
         'على': 'على إستعداد',
         'بيل': 'بيل غيتس',
         'الوباء': 'الوباء القادم',
         'إستعداد': 'إستعداد له',
         'és': 'koronavírus és',
         'állnak': 'kik állnak',
         'zu': 'könig zu',
         'zero': 'agenda zero'}

words = set(nltk.corpus.words.words())

new_string = ''.join(w for w in nltk.wordpunct_tokenize(str(dict_)) \
             if w.lower() in words or not w.isalpha())

new_dic = ast.literal_eval(new_string)
new_dic = {k: v for k, v in new_dic.items() if k and v is not None}
print(new_dic)
Higs
  • 384
  • 2
  • 7
  • 1
    I have updated it now. Your output should now look like this: {' ': ' 19 ', ' ': ' ', ' ': ' ', ' ': ' ', ' ': ' ', ' ': ' ', ' ': ' ', ' ': ' ', ' ': ' ', ' ': ' ', ' ': ' ', ' zero ': ' agenda zero '}. – Higs Oct 11 '20 at 19:28
  • Hi The_Ark, can you please specify your question? :) – Higs Oct 11 '20 at 19:31
  • All right, now I understand your question. Let me see what I can do. – Higs Oct 11 '20 at 19:48
  • 1
    I have just updated the code. Now you have your desired output.It is important from the language toolkit to download the word list(words) The code is not nicely written now, but you can take care of that :) – Higs Oct 11 '20 at 20:12
  • Thank you so much Higs. –  Oct 12 '20 at 20:58