GridSearch for doc2vec model built using gensim

Question

I am trying to find best hyperparameters for my trained doc2vec gensim model which takes a document as an input and create its document embeddings. My train data consists of text documents but it doesn't have any labels. i.e. I just have 'X' but not 'y'.

I found some questions here related to what I am trying to do but all of the solutions are proposed for supervised models but none for unsupervised like mine.

Here is the code where I am training my doc2vec model:

def train_doc2vec(
    self,
    X: List[List[str]],
    epochs: int=10,
    learning_rate: float=0.0002) -> gensim.models.doc2vec:

    tagged_documents = list()

    for idx, w in enumerate(X):
        td = TaggedDocument(to_unicode(str.encode(' '.join(w))).split(), [str(idx)])
        tagged_documents.append(td)

    model = Doc2Vec(**self.params_doc2vec)
    model.build_vocab(tagged_documents)

    for epoch in range(epochs):
        model.train(tagged_documents,
                    total_examples=model.corpus_count,
                    epochs=model.epochs)
        # decrease the learning rate
        model.alpha -= learning_rate
        # fix the learning rate, no decay
        model.min_alpha = model.alpha

    return model

I need suggestions on how to proceed and find best hyperparameters for my trained model using GridSearch or any suggestions about some other technique. Help is much appreciated.

Your loop calling `train()` multiple times is very broken, and will only get more broken once you start trying different combinations of `epochs`, `alpha`, and `learning_rate`. Where did you copy this logic from? — gojomo, Oct 19 '18 at 03:59
Got it from my friends github repository. This model gives me 75% train accuracy. What else do you suggest ? How can I make this less broken ? And how can I tune Parameters? — Rajat, Oct 19 '18 at 05:50
@gojomo i tried to remove the for loop and train model without it but I got a very bad accuracy (55%), but with that loop(running 10 time) I am getting 75%. — Rajat, Oct 19 '18 at 11:06
Then your friends' github repo has a serious flaw & shouldn't be used as a model. Can you ask them where they got it? Call `train()` only once, with your desired number of `epochs`. The current code is a mess that (among other things) is actually doing 10*10 training passes and sends the learning-rate all-over-the-place (down and up again) during training. If it's helping, it's pure dumb luck – and something like (possibly) just using 100 `epochs` in non-broken code would do better. — gojomo, Oct 19 '18 at 15:33
@gojomo I removed the loop as u suggested and I am getting 74% accuracy with 40 epochs(65% with 100 epochs). I also tried different combinations of parameters but 74% is the max that I have seen till now. Also I have 230 train documents, Is there any way I can increase the accuracy of my model ? — Rajat, Oct 22 '18 at 08:11
Your dataset is tiny - 1/100th to 1/20,000th the size of the smallest datasets used in the original 'Paragraph Vector' papers. So mainly: get more data, or use other algorithms which aren't as data-hungry. It's also unclear what 'accuracy' you're talking about, as you haven't described your end-task. — gojomo, Oct 22 '18 at 12:50
But, if more training (epochs) is hurting, that strongly suggests your model is 'overfitting' - the model is too large for your data, and thus is essentially 'memorizing' the idiosyncracies of your data to meet its training goals, and is thus becoming less useful/general for other tasks. Get more data, or shrink the model – for example by using a smaller vector-size dimensionality, and/or use a higher word `min_count` to discard more rare words. — gojomo, Oct 22 '18 at 12:51
My goal is to identify the tags present inside a given document(or paragraph). Tags represents what kind of data is present in the given paragraph, if paragraph contains info related to security it's tag should be "security". Similarly I have 20 tags like "payment","data usage" etc. So once i train the doc2vec model, I give unseen paragraphs to the model and it is generating the doc vectors for it. Then I use nearest neighbor approach(finding the most nearest paragraph in train set using cosine similarity to the given unseen paragraph) to find the tags associated with the unseen paragraph. — Rajat, Oct 22 '18 at 13:04
Then i measure the accuracy by checking if the predicted tags of the given unseen paragraph is correct or not. I am using vector_size=10 and min_count=2. — Rajat, Oct 22 '18 at 13:06
OK, you're doing classification into single categories, with 20 categories, by a K-Nearest-Neighbors (KNN) classifier, with k=1. (That is, just assuming the single "nearest" known-class doc is the best indicator of an unknown doc's category.) With just 230 docs, and 20 categories, you've got less than 12 examples per class on average. If you want to do better, you'll likely need a lot more data. You might want to try other classifiers. (Note that with more known-docs, KNN becomes more expensive.) Good luck! — gojomo, Oct 22 '18 at 13:14
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/182269/discussion-between-fateh-and-gojomo). — Rajat, Oct 22 '18 at 13:16
https://stackoverflow.com/questions/50278744/pipelline-and-gridsearch-for-doc2vec?rq=1 — Venkatachalam, Oct 29 '18 at 05:50

score 3 · Answer 1 · edited Feb 09 '21 at 14:43

Independently by the correctness of the code, I will try to answer to your question on how to perform a tuning of hyper-parameters. You have to start defining a set of hyper-parameters that will define your hyper-parameter grid search. For each set of hyper-parameters

Hset1=(par1Value1,par2Value1,...,par3Value1)

you train your model on the training set and you use an independent validation set to measure your accuracy (or whatever metrics you wish to use). You store this value (e.g. A_Hset1). When you do this for all the possible set of hyper-parameters you will have a set of measures

(A_Hset1,A_Hset2,A_Hset3...A_HsetK).

Each one of those measure tells you how good is your model for each set of hyper-parameters so your set of of optimal hyper-parameters

H_setOptimal= HsetX | A_setX=max(A_Hset1,A_Hset2,A_Hset3...A_HsetK)

In order to have a fair comparisons you should train the model always on the same data and use always the same validation set.

I'm not an advanced Python user so probably you can find better suggestions around, but what I would do is to create a list of dictionaries, where each dictionary contain a set of hyper-parameters that you want to test:

grid_search=[{"par1":"val1","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val2","par2":"val1","par3":"val1",..., "res"=""},
             {"par1":"val3","par2":"val1","par3":"val1",..., "res"=""},
             ,...,
             {"par1":"valn","par2":"valn","par3":"valn",..., "res"=""}]

So that you can store your results in the "res" field of the corresponding dictionary and track the performances for each set of parameter.

for set in grid_search:
  #insert here your training and accuracy evaluation using the
  #parameters in set
  
  set["res"]= the_Accuracy_for_HyperPar_in_set

I hope it helps.

GridSearch for doc2vec model built using gensim

1 Answers1

Linked