Looping over rows in Pandas dataframe taking too long

Question

I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?

df4 is a panda data frame. df4.head() looks like this

df4 = pd.DataFrame({ 
    'hashtag':np.random.randn(3000000),
    'sentiment_score':np.random.choice( [0,1], 3000000),
    'user_id':np.random.choice( ['11','12','13'], 3000000),
    })

What I am aiming to have is a new column called rating.

len(df4.index) is 3,037,321.

ratings = []
for index in df4.index:
    rowUserID = df4['user_id'][index]
    rowTrackID = df4['track_id'][index]
    rowSentimentScore = df4['sentiment_score'][index]

    condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
    allRows = df4[condition]
    totalSongListendForContext = len(allRows.index)

    rows = df4[(condition & (df4['track_id'] == rowTrackID))]
    songListendForContext = len(rows.index)

    rating = songListendForContext/totalSongListendForContext
    ratings.append(rating)

Looping is not good in this case. And you should give a small sample data and expected output from that sample. That way people are more likely to help you. — Quang Hoang, Mar 12 '20 at 20:04
No, not like that. Something as suggested in [this question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — Quang Hoang, Mar 12 '20 at 20:10
Hi @Ben.T , condition is ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore)) — Sandeep Thapa, Mar 12 '20 at 21:35

score 1 · Answer 1 · answered Mar 13 '20 at 00:13

Globally, you'll need groupby. you can either:

use two groupby with transform to get the size of what you called condition and the size of the condition & (df4['track_id'] == rowTrackID), divide the second by the first:

df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
                   / df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))

Or use groupby with value_counts with the parameter normalize=True and merge the result with df4:

df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
                   .value_counts(normalize=True)
                   .rename('ratings').reset_index(),
                how='left')

in both case, you will get the same result as your list ratings (that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.

Looping over rows in Pandas dataframe taking too long

1 Answers1