0

I have a list named 'result' as below

>>> result
    
[
  [
    ['apple'],
    ['banana'],
    ['green','grapes'],
    nan
  ],
  [
    ['orange'],
    ['hat'],
    ['party','hat','2'],
    nan
  ],
  [
    ['blue'],
    ['navy'],
    ['red','t'],
    ['angry']
  ]
]

and I'm using gensim to match the words in the pretrained word2vec model with the words I have and get corresponding vectors.

Given that the pretrained_model.key_to_index is structured as below…

>>> pretrained_model.key_to_index
{'</s>': 0,
     'in': 1,
     'for': 2,
     'that': 3,
     'is': 4,
     'on': 5,
     '##': 6,
     'The': 7,
     'with': 8,
     'said': 9,
     'was': 10,
     'the': 11,
     'at': 12,
    ...}

…I used below code to store list of words within 'result' that is present in pretrained model named 'pretrained_model' and to filter the words that are not in pre trained model.

import gensim

pretrained_model = gensim.models.KeyedVectors.load_word2vec_format('Downloads/GoogleNews-vectors-negative300.bin', binary=True) 
    
vocabulary = pretrained_model.key_to_index
        
len(vocabulary)  # =3000000
    
    
documents = []
for x in result:
    document = [i for i in j for j in x if i in pretrained_model.key_to_index]
    documents.append(document)

now this documents have only those words which are present in pre trained model's vocab.

So the desired output documents might look like

 [[['apple'],['banana'],['green','grapes']],[['orange'],['hat'],['party','hat']],[['blue'],['navy'],['red','t'],['angry']]]

However above code returns NameError as below

NameError                                 Traceback (most recent call last)
/var/folders/jd/lh_mnln92n17ysb4p01g000gn/T/ipykernel_2855/2806541.py in <module>
      1 documents = []
      2 for x in result:
----> 3     document = [i for i in j for j in x if i in pretrained_model.key_to_index]
      4     documents.append(document)
      5 #now this document have only those words which are present in our model's vocab

NameError: name 'j' is not defined

Can anyone help on me this please? Any help would be greatly appreciated!

gojomo
  • 52,260
  • 14
  • 86
  • 115
Mimieffle
  • 13
  • 4
  • `document = [i for j in x for i in j if i in pretrained_model.key_to_index]` – Chris Charley Sep 21 '22 at 18:51
  • Just ran the suggested code and it returns TypeError : 'float' object is not iterable... But thank you for your help though – Mimieffle Sep 21 '22 at 18:57
  • You have a 3 dimensional list - not sure how you would access items. (Also, not sure about the `nan` items) – Chris Charley Sep 21 '22 at 19:09
  • I guess the question was not specific enough so I've edited the post! since the nan values are not in 'pretrained_model.key_to_index', I expect the for loops will drop those nan values and store only the ones that matches with the words within pretrained_model.key_to_index – Mimieffle Sep 21 '22 at 19:24
  • "since the nan values are not in 'pretrained_model.key_to_index', I expect the for loops will drop those nan values" It can't do this, because the `nan` values are at the same "level" of the structure as **lists of** strings, and the code is trying to iterate over those `nan`s as if they were lists, giving the error shown. – Karl Knechtel Sep 22 '22 at 00:20

1 Answers1

0

First, note that your result structure seems a bit weird. (I've manually added 'pretty-print'-like indentation to make its nesting levels clearer.)

It's hard to imagine sensible reasons for so many 1-word lists, and mixing nan values (tricky pseudo-numbers) into some but not all ending-slots of the 2nd-level lists. It's possible you've got a good reason, but it's odd enough to imply that some earlier steps, that created that result list-of-lists-and-nans`, may not have been clearly thought-out.

So I'd mainly suggest going back to your previous step, and trying to have a working structure that's:

  1. less-nested – because tricky nesting & positionally-implied-interpretations are always tricky, especially if you're a beginner; and…

  2. doesn't result in any less-easy-to-handle data types like nan mixed in at the same levels as would otherwise be lists (or strings). (Dealing with nan values should be very rare in beginner code that isn't specifically designed to do math on low-quality inputs.)

But if you did need to do what you're explicitly asking – walking through a 3-level-nested structure, dropping some elements, you'd need some fancier looping structures.

And further, you can't blindly try to treat nan values as if they're the same list objects (and thus iterable) as their sibling-level list elements. You'd need to add a test to only iterate over things that are iterable.

(If after integrating all the comment suggestions & this answer, you're still needing to do the same thing, I'd suggest posting a focused new question on exactly where your latest & best attempt is failing, and with what error.)

gojomo
  • 52,260
  • 14
  • 86
  • 115