5

I am using a pre-trained BERT sentence transformer model, as described here https://www.sbert.net/docs/training/overview.html , to get embeddings for sentences.

I want to fine-tune these pre-trained embeddings, and I am following the instructions in the tutorial i have linked above. According to the tutorial, you fine-tune the pre-trained model by feeding it sentence pairs and a label score that indicates the similarity score between two sentences in a pair. I understand this fine-tuning happens using the architecture shown in the image below:

enter image description here

Each sentence in a pair is encoded first using the BERT model, and then the "pooling" layer aggregates (usually by taking the average) the word embeddings produced by Bert layer to produce a single embedding for each sentence. The cosine similarity of the two sentence embeddings is computed in the final step and compared against the label score.

My question here is - which parameters are being optimized when fine-tuning the model using the given architecture? Is it fine-tuning only the parameters of the last layer in BERT model? This is not clear to me by looking at the code example shown in the tutorial for fine-tuning the model.

Fiori
  • 181
  • 1
  • 12
  • 2
    All layers that your input has passed will be fine-tuned unless you have frozen them. – cronoik Oct 16 '21 at 16:51
  • Thank you @cronoik Do you know how i can specify in the train function which layers to freeze? For instance, I want to freeze all layers except the last one. – Fiori Oct 21 '21 at 01:20
  • 1
    Somewhere in your code you are initializing the optimizer like `AdamW(model.parameters(), lr=2e-5)`. Just use a variable instead of `model.parameters()` that doesn't contain the last layer. – cronoik Oct 22 '21 at 18:48

0 Answers0