1

I am trying to extract the documents vector to feed into a regression model for prediction.

I have fed around 1 400 000 of labelled sentences into doc2vec for training, however I was only able to retrieve only 10 vectors using model.docvecs.

This is a snapshot of the labelled sentences I used to trained the doc2vec model:

In : documents[0]

Out: TaggedDocument(words=['descript', 'yet'], tags='0')

In : documents[-1]

Out: TaggedDocument(words=['new', 'tag', 'red', 'sparkl', 'firm', 'price', 'free', 'ship'], tags='1482534')

These are the code used to train the doc2vec model

model = gensim.models.Doc2Vec(min_count=1, window=5, size=100, sample=1e-4, negative=5, workers=4)
model.build_vocab(documents)
model.train(documents, total_examples =len(documents), epochs=1)

This is the dimension of the documents vectors:

In : model.docvecs.doctag_syn0.shape
Out: (10, 100)

On which part of the code did I mess up?

Update:

Adding on to the comment from sophros, it appear that i have made a mistake when I am creating the TaggedDocument prior to training which resulted in 1.4 mil Documents appearing as 10 Documents.

Courtesy of Irene Li on your tutorial on Doc2vec, I have made some slightly edit to the class she used to generate TaggedDocument

def get_doc(data):

tokenizer = RegexpTokenizer(r'\w+')
en_stop = stopwords.words('english')
p_stemmer = PorterStemmer()

taggeddoc = []

texts = []
for index,i in enumerate(data):
    # for tagged doc
    wordslist = []
    tagslist = []
    i = str(i)
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)
    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    # remove numbers
    number_tokens = [re.sub(r'[\d]', ' ', i) for i in stopped_tokens]
    number_tokens = ' '.join(number_tokens).split()
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in number_tokens]
    # remove empty
    length_tokens = [i for i in stemmed_tokens if len(i) > 1]
    # add tokens to list
    texts.append(length_tokens)

    td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))

    taggeddoc.append(td)

return taggeddoc

The mistake was fixed when I made the change from

td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index))

to this

td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),[str(index)])

It appear that the index of the TaggedDocument must be in the form of the list for TaggedDocument to work properly. For more details as to why, please refer to this answer by gojomo.

  • How many documents do you have? Are you sure you do not have 1.4m sentences across 10 documents only? – sophros Jan 23 '18 at 13:41

1 Answers1

0

The gist of the error was: the tags for each individual TaggedDocument were being provided as plain strings, like '101' or '456'.

But, tags should be a list-of-separate tags. By providing a simple string, it was treated as a list-of-characters. So '101' would become ['1', '0', '1'], and '456' would become ['4', '5', '6'].

Across any number of TaggedDocument objects, there were thus only 10 unique tags, single digits ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']. Every document just caused some subset of those tags to be trained.

Correcting tags to be a list-of-one tag, eg ['101'], allows '101' to be seen as the actual tag.

gojomo
  • 52,260
  • 14
  • 86
  • 115