0

I have the following codes to calculate the average of outputs in DataFrame with the data from a XLSX file. The calculate_score() will return a float score, e.g. 5.12.

import pandas as pd

testset = pd.read_excel(xlsx_filename_here)
total_score = 0
num_records = 0
for index, row in testset.iterrows():
    if row['Data1'].isna() or row['Data2'].isna() or row['Data3'].isna():
        continue
    else:
        score = calculate_score([row['Data1'], row['Data2']], row['Data3'])
        total_score += score
        num_records += 1

print("Average score:", round(total_score/num_records, 2))

According to this answer, df.iterrows() is slow and anti-pattern. How can I change the above codes to use either Vectorization or List Comprehension?


UPDATE

I over-simplify the calculate_score() in the example above, it is actually calculating the BLEU score of some sentences using SacreBLEU library:

import evaluate
sacrebleu = evaluate.load("sacrebleu")

def calculate_score(ref, translation):
    return sacrebleu.compute(predictions=[translation], references=[ref])

Note the original codes updated slightly as well. How can I modify the calculate_score() to use list comprehension? Thanks.

Raptor
  • 53,206
  • 45
  • 230
  • 366
  • You could just use testset['Data1'] and testset['Data2'] as vectors that your calculate_score function can use (not clear at the moment how it calculates a score), remove all scores where where Data1 or Data2 is NaN and then calculate the mean of the score vector. – Alex V. Jun 20 '23 at 13:33
  • Without provide the code of `calculate_score` function it's not possible to vectorize your code because the function take 2 scalar values as parameters... – Corralien Jun 20 '23 at 13:35

2 Answers2

1

Here's how you can modify your code using vectorization:

import pandas as pd
import numpy as np

testset = pd.read_excel(xlsx_filename_here)

valid_rows = testset['Data1'].notna() & testset['Data2'].notna()

scores = calculate_score(testset.loc[valid_rows, 'Data1'], testset.loc[valid_rows, 'Data2'])

average_score = np.mean(scores)

print("Average score:", round(average_score, 2))
Phoenix
  • 1,343
  • 8
  • 10
  • Almost. Originally the `calculate_score()` only accepts `row['Data1']` and `row['Data2']` ( integers) as input. Now the `testset.loc[valid_rows, 'Data1']` returns a list of integers. – Raptor Jun 20 '23 at 13:57
  • I have revised the codes by adding `calculate_score()` – Raptor Jun 21 '23 at 06:12
1

You have to modify the implementation of calculate_score to take two Series as parameter (or one DataFrame of two columns) instead of two scalar values:

def calculate_score(sr1, sr2):
    out = sr1 / sr2
    return out  # out is a Series

# Hide unwanted rows
cols = ['Data1', 'Data2']
m = testset[cols].notna().all(axis=1)

# Compute score with vectorized function
score = calculate_score(testset.loc[m, cols[0]], testset.loc[m, cols[1]])

# Stats
total_score, average_score = score.agg(['sum', 'mean'])

Output:

>>> score
0    0.333333
1    0.142857
3    2.000000
5    0.500000
6    1.000000
7    0.000000
9    0.375000
dtype: float64

>>> total_score
4.351190476190476

>>> average_score
0.6215986394557823

Input:

>>> testset
   Data1  Data2
0    2.0    6.0
1    1.0    7.0
2    NaN    4.0
3    4.0    2.0
4    4.0    NaN
5    4.0    8.0
6    1.0    1.0
7    0.0    5.0
8    NaN    5.0
9    3.0    8.0
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I have revised the codes by adding `calculate_score()`. Can you advise me how to modify the codes? Thanks. – Raptor Jun 21 '23 at 06:12