1

I am using gensim to train a Doc2Vec model on documents assigned to particular people. There are 10 million documents and 8,000 people. I don't care about all 8,000 people. I care about a specific group of people (say anywhere from 1 to 500).

The people I'm interested in could change day-to-day, but I will never need to look at the full population. The end goal is to have the resulting vectors of the people I am interested in. I am currently training the model each time on the documents assigned to the specific people.

Should I train the model on all 10 million documents? Or should I train the model on only the documents assigned to the people I'm interested in? If it's important to train it on all 10 million documents, how would I then get the vectors only for the people I'm interested in?

OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
  • This totally depends upon what you want to do with those vectors. Do you want to predict a person given a vector? – vumaasha Feb 23 '18 at 13:42
  • no, i just want to take the vectors of the specified people and feed them into TensorBoard to do high-dimensionality visualization and look at the distances between the vectors (i.e. natural clusters) – OverflowingTheGlass Feb 23 '18 at 13:44
  • do you need one vector for one person or one vector for a document – vumaasha Feb 23 '18 at 13:45
  • one vector per document. so say i have 10 people i want to look at on a given day and they have a collective 20,000 documents. i need 20,000 vectors which will then be fed into TensorBoard and filtered on the front-end, so i am only looking at the vectors for a particular person. – OverflowingTheGlass Feb 23 '18 at 13:48

1 Answers1

4

It is a good idea to train on all the 10 million documents, that will help you capture the general essence of the words and not just with in the context of authors that you are interested in. Also, it will help you if the set of authors who you are interested in, changes tomorrow.

If you think Doc2Vec takes a lot of time, you could also use Fasttext to learn WordEmbeddings and use a simple average or TF-IDF weighted average on the word vectors to construct your DocumentVector. You could leverage the power of hierarchical softmax (loss function) in Fasttext that will reduce your training time by 1000+ folds.

vumaasha
  • 2,765
  • 4
  • 27
  • 41
  • how would i then get the vectors of the authors i'm interested in? infer vectors? or is there some way to just extract the vectors that are already created? also, using the fasttext method wouldn't provide the same type of fixed-dimension vector, correct? – OverflowingTheGlass Feb 23 '18 at 13:56
  • Get the document corresponding to the author, tokenize, look up into pretrained word vectors, average the vectors of all the words in the document – vumaasha Feb 23 '18 at 13:58
  • if you use Doc2Vec, you should get one vector for each document directly though. using the document id you can look up the author – vumaasha Feb 23 '18 at 13:59
  • right, so train on 10M documents provides 10M vectors, one for each document. i know how to access a vector by it's index: `model.docvecs.vectors_docs[i]`, or access all vectors `model.docvecs.vectors_docs`. but how would i access vectors for a given author? or a set of given authors? – OverflowingTheGlass Feb 23 '18 at 14:01
  • if i only care about a specific group of authors, is there any downside to training just on that group (outside of the annoyance of retraining each time my set of authors changes)? intuitively, it seems like it would be good to only know the essence of words in the context of an author set, if i only care about that authorset. – OverflowingTheGlass Feb 23 '18 at 14:02
  • you need to maintain a separate index from document to author, like document_id,author_id. You prepare this while preparing the training data. use this to filter the documents you need – vumaasha Feb 23 '18 at 14:02
  • If you only train with a group of documents, you don't capture the general essence of the words outside those authors. Some words may not get good representation since their frequency of occurrence will be less comparatively – vumaasha Feb 23 '18 at 14:05
  • 1
    that makes sense - thank you very much for all your help! marking your answer as correct. do you happen to know of any resources concerning creating a separate index to easily query the trained vectors? – OverflowingTheGlass Feb 23 '18 at 14:08
  • a normal btree index in any rdbms should be good enough to maintain the document, author index – vumaasha Feb 23 '18 at 14:10
  • ah - so are the vectors spit out by doc2vec guaranteed to be in the same order they go in? – OverflowingTheGlass Feb 23 '18 at 14:12
  • 1
    I am not sure about your last question. I haven't played around with gensim. Try this link though https://stackoverflow.com/questions/31321209/doc2vec-how-to-get-document-vectors – vumaasha Feb 23 '18 at 14:19