3

I am iterating through a pandas dataframe (df) and adding scores to a dictionary containing python lists (scores):

for index, row in df.iterrows():
    scores[row["key"]][row["pos"]] = scores[row["key"]][row["pos"]] + row["score"]

The scores dictionary initially is not empty. The dataframe is very large and this loop takes a long time. Is there a way to do this without a loop or speed it up in some other way?

Jumee
  • 33
  • 3

1 Answers1

3

A for loop seems somewhat inevitable, but we can speed things up with NumPy's fancy indexing and Pandas' groupby:

# group the scores over `key` and gather them in a list
grouped_scores = df.groupby("key").agg(list)

# for each key, value in the dictionary...
for key, val in scores.items():
    
    # first lookup the positions to update and the corresponding scores
    pos, score = grouped_scores.loc[key, ["pos", "score"]]

    # then fancy indexing with `pos`: reaching all positions at once
    scores[key][pos] += score
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • 1
    Thanks! My code execution time went from 300 sec to 6 sec. – Jumee May 31 '21 at 06:22
  • @MustafaAydın Nice answer. Please also explain what's "NumPy's fancy indexing" here that sped this. Couldn't understand that :) – Ank May 31 '21 at 07:11
  • 1
    @Ank Oh, thanks. I thought that term is official from NumPy docs, but it's not, sorry. You can refer [here](https://jakevdp.github.io/PythonDataScienceHandbook/02.07-fancy-indexing.html). It boils down to being able to `the_list[[2, -1, 3, 4]]` at once instead of doing it separately. It applies to any N-dimensional array. I think the link explains better than me, and maybe they coined the term "fancy indexing", not sure `:)`. – Mustafa Aydın May 31 '21 at 07:18