2

I have this piece of code:

import gensim
import random


file = open('../../../dataset/output/interaction_jobroles_titles_tags.txt')

read_data = file.read()

data = read_data.split('\n')

sentences = [line.split() for line in data]
print(len(sentences))
print(sentences[1])

model = gensim.models.Word2Vec(min_count=1, window=10, size=300, negative=5)
model.build_vocab(sentences)

for epoch in range(5):
    shuffled_sentences = random.shuffle(sentences)
    model.train(shuffled_sentences)
    print(epoch)
    print(model)

model.save("../../../dataset/output/wordvectors_jobroles_titles_300d_10w_wordshuffling" + '.model')

If I print a single sentence, then it output is something like this:

['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']

What I need is to shuffle the words before training and then save the model.

I am not sure whether I am coding it in a right way. I end up with exception:

Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.5/site-packages/gensim/models/word2vec.py", line 747, in job_producer
    for sent_idx, sentence in enumerate(sentences):
  File "/usr/local/lib/python3.5/site-packages/gensim/utils.py", line 668, in __iter__
    for document in self.corpus:
TypeError: 'NoneType' object is not iterable

I would like to ask you how can I shuffle words.

lmo
  • 37,904
  • 9
  • 56
  • 69
ssh26
  • 287
  • 4
  • 12

2 Answers2

0

Random.shuffle shuffles the list inplace and returns none. For this reason your shuffled sentences are None after this call.

PKuhn
  • 1,338
  • 1
  • 14
  • 30
0
model.build_vocab(sentences)
sentences_list = sentences
Idx = range(len(sentences_list))
print(Idx)
for epoch in range(5):
    random.shuffle(sentences)
    perm_sentences = [sentences_list[i] for i in Idx]
    model.train(perm_sentences)
    print(epoch)
    print(model)
   model.save("somefile'.model')

This solves my problem.

But how can shuffle individual words in a sentence?

Sentence: ['JO_3787672', 'JO_272304', 'JO_2027410', 'TI_2969041', 'TI_2509936', 'TA_954638', 'TA_4321623', 'TA_339347', 'TA_272304', 'TA_3017535', 'TA_494116', 'TA_798840']

my objective is: If I check for most similar word for, let say ['JO_3787672'], then every time it will predict words starting from 'JO_'. and the words starting from 'TA_' and 'TI_' have really less similarity score. I suspected that, this is because of the words position in the data(I am not sure). That is why I try to do shuffling between word( I am really not sure whether it helps or not).

ssh26
  • 287
  • 4
  • 12
  • Word2Vec is designed to determine similarity between words from word order, or 'context'. What you're looking for is probably a bag-of-words approach. – Swier Jul 28 '17 at 13:31