I am using gensim
to train a Doc2Vec
model on documents assigned to particular people. There are 10 million documents and 8,000 people. I don't care about all 8,000 people. I care about a specific group of people (say anywhere from 1 to 500).
The people I'm interested in could change day-to-day, but I will never need to look at the full population. The end goal is to have the resulting vectors of the people I am interested in. I am currently training the model each time on the documents assigned to the specific people.
Should I train the model on all 10 million documents? Or should I train the model on only the documents assigned to the people I'm interested in? If it's important to train it on all 10 million documents, how would I then get the vectors only for the people I'm interested in?