2

I'm trying to get the output embeddings of a RoBERTa model, so I can train a random forests classifier on it for text classification (sentiment analysis). The original dataset this is based on is 500 news articles that each have a left/center/right bias rating. 80% of this dataset is training data, the other 20% is test data.

I run the following code for my training set:

# Tokenize sentences van trainingset 
encoded_input = tokenizer(X_train, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling 
sentence_embeddings = mean_pooling(model_output,encoded_input['attention_mask'])

## Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=3, dim=1)

start = True
numpy_emb = [] 

if not start:
    np_emb = sentence_embeddings.cpu().detach().numpy()
    numpy_emb = np.vstack([numpy_emb, np_emb]) 
else:
    start = False
    numpy_emb = np_emb = sentence_embeddings.cpu().detach().numpy()

Which gives me numpy_emb. Which I think are the embeddings that the RoBERTa model outputs. When I print it, it gives me:

tensor([[ 0.5329, -0.1224],
        [ 0.5409, -0.0730],
        [ 0.4594, -0.1282],
        [ 0.5116, -0.0769],
        [ 0.4861, -0.0212],
        [ 0.5246, -0.0560],
        [ 0.5555, -0.0962],
        [ 0.4779, -0.0551],
        [ 0.5428, -0.0904],
        [ 0.5939, -0.0504],
        [ 0.5219, -0.1342],
        [ 0.4672, -0.0936],
        [ 0.5051, -0.0518],
        [ 0.5536, -0.1016],
        [ 0.4761, -0.0736],
        [ 0.4754, -0.0991],
        [ 0.5613, -0.0541],
        [ 0.5155,  0.0303],
        [ 0.6053,  0.0214],
        [ 0.4766, -0.1019],
        [ 0.4262, -0.0869],
        [ 0.3871, -0.0756],
        [ 0.5048, -0.0067],
        [ 0.5425, -0.1303],
        [ 0.5020, -0.0715],
    ...
        [ 0.5462, -0.0686],
        [ 0.5476, -0.1465],
        [ 0.4968, -0.0354],
        [ 0.5586, -0.1234],
        [ 0.5725, -0.0685]])

I then repeat this process for my test set as well, giving me another set of embeddings.

Then I try to train a random forests classifier using the embeddings given by the training set. But when I try to predict using the embeddings from my test set, I get very random results. Accuracy goes as low as 24% and as high as 58%. Is this because of the small amount of data that I have? Or is there something else I'm doing wrong?

I also have the suspicion that I can't properly link the output embedding to their respective label. Which would also explain the random results I get.

Code for random forests that I used:

from sklearn.ensemble import RandomForestClassifier
text_classifier = RandomForestClassifier(n_estimators=100, random_state=0)
text_classifier.fit(numpy_emb, y_train)

predictions = text_classifier.predict(numpy_emb_test)

#confusion matrix
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
pyroshark
  • 49
  • 6

1 Answers1

1

This shape doesn't look like proper embedding. For classification purpose, a usual approach to encoder-only models is just supplying the last hidden state as embeddings for the classifier, for example:

features = model_output[0][:,0,:].numpy()
text_classifier.fit(features, y_train)
dx2-66
  • 2,376
  • 2
  • 4
  • 14
  • Thank you for helping me. But even after using the embeddings you propose, I still get very low and volatile accuracy if I run the whole program (embedding and classifying) a few times. On a scatterplot, I can see that the labels are hard to differentiate based on the embeddings. Could this be because of the low amount of data? Or is it simply that RoBERTa can't differentiate between them? (It is a very hard task I'm training it on) – pyroshark Aug 22 '22 at 08:45
  • 1
    Are you sure you're loading a proper pretrained model? Could it be your texts are too different from what it was trained upon? Is the accuracy any better when using sklearn vectorizers instead? You can also try to finetune `RobertaForSequenceClassification()` (which passes the embeddings to a linear layer) to see whether it's suitable. Also, using a simple model like `LogisticRegression()` often yields better results for this task than complex ones like random forest. – dx2-66 Aug 22 '22 at 10:01