0

I am using zero shot classification to label large amounts of data. I have written a simple function to assist me with this and am wondering if there is a better way for this to run. My current logic was to take the highest score and label and append this label into a dataframe.

def labeler(input_df,output_df):
    labels = ['Fruit','Vegetable','Meat','Other']


    for i in tqdm(range(len(input_df))):
        temp = classifier(input_df['description'][i],labels)
        output ={'work_order_num':input_df['order_num'][i],
                 'work_order_desc':input_df['description'][i],
                'label':temp['labels'][0],
                'score':temp['scores'][0]}
        output_df.append(output)

In terms of speed and resources would it be better to shape this function with lambda?

sophros
  • 14,672
  • 11
  • 46
  • 75
Zachqwerty
  • 85
  • 6

1 Answers1

0

Your problem boils down to iteration over the pandas dataframe input_df. Doing that with a for loop is not the most efficient way (see: How to iterate over rows in a DataFrame in Pandas).

I suggest doing something like this:

output_df['work_order_num', 'work_order_desc'] = input_df['order_num', 'description']  # these columns can be copied as whole.

def classification(df_desc):
    temp = classifier(df_desc, labels)
    return temp['labels'][0], temp['scores'][0]
    
output_df['label'], output_df['score'] = zip(*input_df.apply(classification))

classification function returns tuples of values that need to be unpacked so I used the zip trick from this question.

Also, building a dataframe by concatenation is a very slow process too. So with the solution above you omit two potentially prohibitively slow operations: slow for-loop and appending rows to a dataframe.

sophros
  • 14,672
  • 11
  • 46
  • 75