Testing the Model doc2vec in all test corpus

Question

I am trying to apply the doc2vec tutorial and instead of testing on a random test corpus document, testing on the entire test corpus

I just modified the following line:

code:

# Pick a random document from the test corpus and infer a vector from the model

#doc_id = random.randint(0, len(test_corpus) - 1)
doc_id = [index for index, text in enumerate(test_corpus)]

inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

Error:

TypeError: list indices must be integers or slices, not list

score 1 · Answer 1 · answered Sep 07 '19 at 14:38

In the original tutorial, test_corpus is a list of lists, the doc_id is a single random integer, and so in the statement

inferred_vector = model.infer_vector(test_corpus[doc_id])

the argument is a list with an integer index doc_id.

In your modified version,

doc_id = [index for index, text in enumerate(test_corpus)]

will produce a list of integers, not a single integer.

So in test_corpus[doc_id], doc_id is now a list, so you are attempting to index a list with a list

That leads to exactly the error you see:

TypeError: list indices must be integers or slices, not list

To do what you want to do, you probably want to convert the list of lists test_corpus to a single long list, as shown in this old answer How to make a flat list out of list of lists

Testing the Model doc2vec in all test corpus

1 Answers1