Replace a loop over pandas DataFrame

Question

I am iterating through a pandas dataframe (df) and adding scores to a dictionary containing python lists (scores):

for index, row in df.iterrows():
    scores[row["key"]][row["pos"]] = scores[row["key"]][row["pos"]] + row["score"]

The scores dictionary initially is not empty. The dataframe is very large and this loop takes a long time. Is there a way to do this without a loop or speed it up in some other way?

Hi, welcome to stackoverflow. Can you provide a reproducible sample of your data. Refer to this on how to provide code sample- https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — SunilG, May 31 '21 at 05:53
Is numpy an option instead of the Python lists in the `scores` dict? — Mustafa Aydın, May 31 '21 at 05:53
@mustafa-aydın Yes, I can use numpy arrays for the scores dict. — Jumee, May 31 '21 at 06:00

Mustafa Aydın · Accepted Answer · 2021-05-31T07:19:54.873

3

A for loop seems somewhat inevitable, but we can speed things up with NumPy's fancy indexing and Pandas' groupby:

# group the scores over `key` and gather them in a list
grouped_scores = df.groupby("key").agg(list)

# for each key, value in the dictionary...
for key, val in scores.items():
    
    # first lookup the positions to update and the corresponding scores
    pos, score = grouped_scores.loc[key, ["pos", "score"]]

    # then fancy indexing with `pos`: reaching all positions at once
    scores[key][pos] += score

edited May 31 '21 at 07:19

answered May 31 '21 at 06:04

Mustafa Aydın

17,645
4
15
38

1

Thanks! My code execution time went from 300 sec to 6 sec. – Jumee May 31 '21 at 06:22
@MustafaAydın Nice answer. Please also explain what's "NumPy's fancy indexing" here that sped this. Couldn't understand that :) – Ank May 31 '21 at 07:11
1

@Ank Oh, thanks. I thought that term is official from NumPy docs, but it's not, sorry. You can refer [here](https://jakevdp.github.io/PythonDataScienceHandbook/02.07-fancy-indexing.html). It boils down to being able to `the_list[[2, -1, 3, 4]]` at once instead of doing it separately. It applies to any N-dimensional array. I think the link explains better than me, and maybe they coined the term "fancy indexing", not sure `:)`. – Mustafa Aydın May 31 '21 at 07:18

Replace a loop over pandas DataFrame

1 Answers1