How to create a dataset object with for multiple input of texts to the SetFit model?

Question

The Setfit library accept two inputs : "text" and "label", https://huggingface.co/blog/setfit

My goals is to train Setfit using two similarity input with binary label (similar or not similar). ("text1","text2","similiar/not")

The example of dataset look like this (setfit/mnli) dataset:

>>> dataset = load_dataset('setfit/mnli')
>>> dataset

DatasetDict({
    train: Dataset({
        features: ['text1', 'text2', 'label', 'idx', 'label_text'],
        num_rows: 392702
    })
    test: Dataset({
        features: ['text1', 'text2', 'label', 'idx', 'label_text'],
        num_rows: 9796
    })
    validation: Dataset({
        features: ['text1', 'text2', 'label', 'idx', 'label_text'],
        num_rows: 9815
    })
})

I tried:

trainer = SetFitTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss,
metric="accuracy",
column_mapping={"text1": "text","text2": "text", "label": "label"}

)

But fitting the raw Dataset with text1 and text2 doesn't work. Is there any way I could train with those kind of dataset of input?

alvas · Answer 1 · 2023-03-24T20:21:00.093

From https://huggingface.co/blog/setfit, the "SetFit/SentEval-CR" looks like the mnli dataset you're looking at.

If we loop over the dataset, it looks like:

from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer

dataset = load_dataset("SetFit/SentEval-CR")

for row in dataset['train']:
  print(row)
  break

[out]:

{'text': "many of our disney movies do n 't play on this dvd player .", 
'label': 0, 
'label_text': 'negative'}

In this case the model is expecting in each dat point:

text
label
label_text

Since the mnli dataset has two text you can combine them with </s> to form a single text key. First, to confirm that the seperator token is this:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

print(tokenizer.sep_token)  # Output: </s>

then

from datasets import load_dataset


# Load a dataset from the Hugging Face Hub
dataset = load_dataset('setfit/mnli')

dataset = dataset.map(lambda row: {"text": row['text1'] + " <s> " + row['text2']})

dataset

[out]:

DatasetDict({
    train: Dataset({
        features: ['text1', 'text2', 'label', 'idx', 'label_text', 'text'],
        num_rows: 392702
    })
    test: Dataset({
        features: ['text1', 'text2', 'label', 'idx', 'label_text', 'text'],
        num_rows: 9796
    })
    validation: Dataset({
        features: ['text1', 'text2', 'label', 'idx', 'label_text', 'text'],
        num_rows: 9815
    })
})

To train the model following the example from https://huggingface.co/blog/setfit

from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer, sample_dataset

# Load a dataset from the Hugging Face Hub
dataset = load_dataset('setfit/mnli')
dataset = dataset.map(lambda row: {"text": row['text1'] + " </s> " + row['text2']})

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
eval_dataset = dataset["validation"]

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=20, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for contrastive learning
    column_mapping={"sentence": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()

Hi! Thank you for answering my question. Sorry I need to clarify that what I am going to achieve is this two jointed sentence is similar or not. In your example, I think the goals is with two jointed sentence is labeled as 1 and another sample with label 0 is not similar with the label 1. Is that correct? Is my goal can't be done using Setfit? — wenz, Mar 30 '23 at 07:50
Look at the code carefully. I think it's doing what you want =) Hint: Look at `CosineSimilarityLoss` and what's inside `label` in the dataset. — alvas, Mar 30 '23 at 09:24
Hi., thank you for your help! In my understanding is in `label` in `"SetFit/SentEval-CR"` is going to tell negative sampling (label 0) is the "opposite" of the positive sampling (label 1). (as reference : (https://www.youtube.com/live/8h27lV8v8BU?feature=share&t=1316)). If we add `` to form a single `text`. Does the way it works in `column_mapping` already different? `Mnli` dataset is for each row consist of 2 pairs of similar or not. On the other hand, `"SetFit/SentEval-CR"` all label 1 is similar and all label 0 is opposite of label 1. Please help me understand the concept. thank you. — wenz, Mar 31 '23 at 08:14
It depends on how the labels are defined in the docs of the mnli dataset. Read through the dataset and labels too, you'll figure out what the 1/0s mean. You've got this. Believe in your intuition after reading the dataset one data point at a time. Hint: 1/0 isn't just positive/negative, it's merely a binary label set to any definition the dataset wants it to be. — alvas, Mar 31 '23 at 10:13

How to create a dataset object with for multiple input of texts to the SetFit model?

1 Answers1