0

I have some imbalanced data which I need to classify. I want to use SMOTE to balance it. But I don't really understand how to use it since I have BERT multiple inputs. Do I need to use it for input_ids? Or attention_masks? Or both? Also, a piece of code would be really useful :)

atlas
  • 11
  • 1
  • Welcome to Stackoverflow! Could you explain the task a bit more? Is this sentence classification or token classification? Is this label imbalance or some other features? Is there a similar public data that would be similar to the task? – Mehdi Jul 02 '22 at 08:20
  • @Mehdi It's sentence classification for sentiment analysis, label imbalance (most of the sentences are neutral). Don't know about any similar public data :( – atlas Jul 03 '22 at 14:38
  • this answer sums up the solution: https://stackoverflow.com/a/63379055/2991872 – Mehdi Jul 05 '22 at 23:07
  • @Mehdi I've seen this answers but I still don't get it. Does it mean that I need to only use CLS tokens for classification? – atlas Jul 11 '22 at 11:25
  • Yes, the general solution for the sentence classification tasks is to use the hidden vector representing [CLS] as sentence representation. You can use SMOTE to sample from [CLS] vector space, but that means you won't be able to fine-tune the transformer body of BERT, because there won't be any specific input for synthetic vectors. – Mehdi Jul 12 '22 at 14:02

0 Answers0