I am trying to extract the documents vector to feed into a regression model for prediction.
I have fed around 1 400 000 of labelled sentences into doc2vec for training, however I was only able to retrieve only 10 vectors using model.docvecs.
This is a snapshot of the labelled sentences I used to trained the doc2vec model:
In : documents[0]
Out: TaggedDocument(words=['descript', 'yet'], tags='0')
In : documents[-1]
Out: TaggedDocument(words=['new', 'tag', 'red', 'sparkl', 'firm', 'price', 'free', 'ship'], tags='1482534')
These are the code used to train the doc2vec model
model = gensim.models.Doc2Vec(min_count=1, window=5, size=100, sample=1e-4, negative=5, workers=4)
model.build_vocab(documents)
model.train(documents, total_examples =len(documents), epochs=1)
This is the dimension of the documents vectors:
In : model.docvecs.doctag_syn0.shape
Out: (10, 100)
On which part of the code did I mess up?
Update:
Adding on to the comment from sophros, it appear that i have made a mistake when I am creating the TaggedDocument prior to training which resulted in 1.4 mil Documents appearing as 10 Documents.
Courtesy of Irene Li on your tutorial on Doc2vec, I have made some slightly edit to the class she used to generate TaggedDocument
def get_doc(data):
tokenizer = RegexpTokenizer(r'\w+')
en_stop = stopwords.words('english')
p_stemmer = PorterStemmer()
taggeddoc = []
texts = []
for index,i in enumerate(data):
# for tagged doc
wordslist = []
tagslist = []
i = str(i)
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# remove numbers
number_tokens = [re.sub(r'[\d]', ' ', i) for i in stopped_tokens]
number_tokens = ' '.join(number_tokens).split()
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in number_tokens]
# remove empty
length_tokens = [i for i in stemmed_tokens if len(i) > 1]
# add tokens to list
texts.append(length_tokens)
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))
taggeddoc.append(td)
return taggeddoc
The mistake was fixed when I made the change from
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))
to this
td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),[str(index)])
It appear that the index of the TaggedDocument must be in the form of the list for TaggedDocument to work properly. For more details as to why, please refer to this answer by gojomo.