0

I have been running the code for like 45 mins now and is still going. Can someone please suggest to me how I can make it faster?

df4 is a panda data frame. df4.head() looks like this

df4 = pd.DataFrame({ 
    'hashtag':np.random.randn(3000000),
    'sentiment_score':np.random.choice( [0,1], 3000000),
    'user_id':np.random.choice( ['11','12','13'], 3000000),
    })

What I am aiming to have is a new column called rating.

len(df4.index) is 3,037,321.

ratings = []
for index in df4.index:
    rowUserID = df4['user_id'][index]
    rowTrackID = df4['track_id'][index]
    rowSentimentScore = df4['sentiment_score'][index]

    condition = ((df4['user_id'] == rowUserID) & (df4['sentiment_score'] == rowSentimentScore))
    allRows = df4[condition]
    totalSongListendForContext = len(allRows.index)

    rows = df4[(condition & (df4['track_id'] == rowTrackID))]
    songListendForContext = len(rows.index)

    rating = songListendForContext/totalSongListendForContext
    ratings.append(rating)

1 Answers1

1

Globally, you'll need groupby. you can either:

use two groupby with transform to get the size of what you called condition and the size of the condition & (df4['track_id'] == rowTrackID), divide the second by the first:

df4['ratings'] = (df4.groupby(['user_id', 'sentiment_score','track_id'])['track_id'].transform('size')
                   / df4.groupby(['user_id', 'sentiment_score'])['track_id'].transform('size'))

Or use groupby with value_counts with the parameter normalize=True and merge the result with df4:

df4 = df4.merge(df4.groupby(['user_id', 'sentiment_score'])['track_id']
                   .value_counts(normalize=True)
                   .rename('ratings').reset_index(),
                how='left')

in both case, you will get the same result as your list ratings (that I assume you wanted to be a column). I would say the second option is faster but it depends on the number of groups you have in your real case.

Ben.T
  • 29,160
  • 6
  • 32
  • 54