2

I am using DistilBERT to do sentiment analysis on my dataset. The dataset contains text and a label for each row which identifies whether the text is a positive or negative movie review (eg: 1 = positive and 0 = negative). Here is the code from the huggingface documentation (https://huggingface.co/transformers/custom_datasets.html?highlight=imdb)

#This dataset can be explored in the Hugging Face model hub (IMDb), and can be alternatively downloaded with the  Datasets library with load_dataset("imdb").


wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar -xf aclImdb_v1.tar.gz


#This data is organized into pos and neg folders with one text file per example. Let’s write a function that can read this in.

from pathlib import Path

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

#Now that our datasets our ready, we can fine-tune a model either #with the  Trainer/TFTrainer or with native PyTorch/TensorFlow. See #training.

#Fine-tuning with Trainer

#The steps above prepared the datasets in the way that the trainer is #expected. Now all we need to do is create a model to fine-tune, #define the TrainingArguments/TFTrainingArguments and instantiate a #Trainer/TFTrainer.

from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated  Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()


#We can also train with Pytorch/Tensorflow

from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()

I want to know test this model on a new piece of data. So, I have a dataframe which contains a piece of text/review for each row, and I want to predict the label. Does anyone know how I would go about doing that? I apologize, I am very new to this and would greatly appreciate any help! I tried taking in text, cleaning it, and then doing

prediction = model.predict(text)

and I got an error saying DistilBERT has no attribute .predict.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
brownie_coder
  • 17
  • 1
  • 4

3 Answers3

2

If you just want to use the model, you can use the corresponding pipeline:

from transformers import pipeline
classifier = pipeline('sentiment-analysis')

Then you can use it:

classifier("I hate this book")
NicolasPeruchot
  • 439
  • 5
  • 9
2

The code that you've shared from the documentation essentially covers the training and evaluation loop. Beware that your shared code contains two ways of fine-tuning, once with the trainer, which also includes evaluation, and once with native Pytorch/TF, which contains just the training portion and not the evaluation portion.

Here is how the native method can be tweaked to generate predictions on the test set:

# Put model in evaluation mode
model.eval()

# Tracking variables for storing ground truth and predictions 
predictions , true_labels = [], []

# Prediction Loop
for batch in test_dataset:

 
 
  # Unpack the inputs from our dataloader and move to GPU/accelerator 
 
  input_ids = batch['input_ids'].to(device)
  attention_mask = batch['attention_mask'].to(device)
  labels = batch['labels'].to(device)

  
  # Telling the model not to compute or store gradients, saving memory and 
  # speeding up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(input_ids, attention_mask=attention_mask, 
                         labels=labels)

  logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

After the execution of this loop, predictions will contain logits, i.e., the probability distribution from the model before any form of normalization. You can use the following to pick the label with the maximum score from the logits, and produce a classification report

from sklearn.metrics import classification_report, accuracy_score 

# Combine the results across all batches. 
flat_predictions = np.concatenate(predictions, axis=0)

# For each sample, pick the label (0 or 1) with the higher score.
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0)

# Accuracy 
print(accuracy_score(flat_true_labels, flat_predictions))

# Classification Report
report = classification_report(flat_true_labels, flat_predictions)

For a more elegant way of performing predictions, you can create a BERTModel Class that would contain different methods and variables for handling the tokenization, creation of dataloader, running the predictions, etc.

A.T.B
  • 625
  • 6
  • 16
  • Thanks! So I wrote code that tested the model on the test dataset, which was actually part of my original dataset and it returned the classification report. So I did prediction = trainer.predict(test) and I followed along with the softmax function and outputed the report. But how would I test on a df that has data but no labels yet? The test dataset already had labels, but in this case, the new df I want to test on which is not part of the original set has no labels. Do I follow the same process? – brownie_coder Oct 24 '21 at 20:49
  • You'll have to follow the same process with the new dataset - read the file, tokenize the text, create a dataloader without the ground truth labels. Following that, you'll pass the dataset to the predict method to run and return the predictions. – A.T.B Oct 25 '21 at 09:48
1

You can try code like this example: Link-BERT

You'll arrange the dataset according to the BERT model. D Section in this link, you can just change the model name and your dataset.

Anil Guven
  • 31
  • 3