1

I have a dictionary where the keys are words and the values are vectors of those words. I have a list of sentences which I want to convert into an array. I'm getting an array of all the words but I would like to have an array of sentences with word vectors so I can feed it into a neural network

sentences=["For last 8 years life, Galileo house arrest espousing man's theory",
           'No. 2: 1912 Olympian; football star Carlisle Indian School; 6 MLB seasons Reds, Giants & Braves',
           'The city Yuma state record average 4,055 hours sunshine year'.......]    

word_vec={'For': [0.27452874183654785, 0.8040047883987427],
         'last': [-0.6316165924072266, -0.2768899202346802],
         'years': [-0.2496756911277771, 1.243837594985962],
         'life,': [-0.9836481809616089, -0.9561406373977661].....}   

I want to convert the above sentences into vectors of their corresponding words from the dictionary.

user2622016
  • 6,060
  • 3
  • 32
  • 53
Thanos
  • 21
  • 6
  • 4
    Could you edit this question to include proper expected output(you say you want array of sentence with word vector but its unclear what you mean) and what you have tried so far in accordance with guidelines here: [MCVE](https://stackoverflow.com/help/mcve). – Nevus Apr 15 '19 at 08:20

2 Answers2

0

Try this:

def sentence_to_list(sentence, words_dict):
    return [w for w in sentence.split() if w in words_dict]

So the first of the sentences in your example will be converted to:

['For', 'last', 'years', 'life']  # words not in the dictionary are not present here

Update.

I guess you need to remove punctuation characters. There are several methods how to split the string using several delimiter characters, check this answer: Split Strings into words with multiple word boundary delimiters

lesnik
  • 2,507
  • 2
  • 25
  • 24
0

This will create vectors, containing list of lists of vectors (one list per one sentence):

vectors = []
for sentence in sentences:
  sentence_vec = [ word_vec[word] for word in sentence.split() if word in word_vec ]
  vectors.append( sentence_vec )

If you want to ommit puntucations (,.: etc), use re.findall (import re) instead of .split:

words = re.findall(r"[\w']+", sentence)
sentence_vec = [ word_vec[word] for word in words if word in word_vec ]

If you don't want to skip words not available in word_vec, use:

sentence_vec = [ word_vec[word] if word in word_vec else [0,0] for word in words ]

It will place 0,0 for each missing word.

user2622016
  • 6,060
  • 3
  • 32
  • 53