3

I'm trying to zo zero-shot classification over a dataset with 5000 records. Right now I'm using a normal Python loop, but it is going painfully slow. Is there to speed up the process using Transformers or Datasets structures? This is how my code looks right now:

classifier = pipeline("zero-shot-classification", model='cross-encoder/nli-roberta-base')

# Create prediction list
candidate_labels = ["Self-direction: action", "Achievement", "Security: personal", "Security: societal", "Benevolence: caring", "Universalism: concern"]
predictions = []

for index, row in reduced_dataset.iterrows():
    res = classifier(row["text"], candidate_labels)
    partial_prediction = []
    for score in res["scores"]:
        if score >= 0.5:
            partial_prediction.append(1)
        else:
            partial_prediction.append(0)
    
    if index % 100 == 0:
        print(index)
    predictions.append(partial_prediction)

partial_prediction
ignacioct
  • 325
  • 1
  • 12

1 Answers1

1

It is always more efficient to process sentences in batches that can be parallelized. According to the documentation, you can provide a list (or more precisely an Iterable) of sentences Instead of a single input sentence, and it will take automatically take care about all the hassles connected with batching (padding sentences to the same length, estimating batch size to fit memory, etc.) and the pipeline will return an Iterable of predictions.

The documentation even recommends using the dataset objects as inputs to the pipelines.

Jindřich
  • 10,270
  • 2
  • 23
  • 44