Sum values of a column (Revenue) based on: the values of another column (Date) AND the value of another column (UserId) in pandas

Question

I have two tables/dataframes: users and activity.

In users I have the following columns: UserId, Country, DOB, Gender, RegDate, WeekAfterRegDate

where:

UserId: the Id of each user (only appears once in this table), there only one row for each UserId in this dataframe/table --> It is also the key column that links both tables/dataframes

DOB: date of birth

RegDate: Registration date of the user

WeekAfterRegDate: The date after 7 days since registration

In activity I have the following columns: UserId, Date, Revenue

where:

UserId: the same column as in the users, but it can appear in more than one row here as there dare different revenues

I need to calculate the average revenue generated per user in the first week

And I have been given these clues, which might be useful:

Merge the 2 datasets
Calculate the days since registration for each user and date in the activity table
Consider ALL REVENUE (not just the one generated by each user) generated in the first 7 days after registration for each user

In summary what I need to do is make a loop that sums Renevue between two Dates for each UserId. The period between the two dates is RegDate and WeekAfterRegDate.

I have been trying different methods, like groupby, etc, but I am a bit lost.

score 1 · Answer 1 · answered Mar 10 '21 at 11:49

Make sure your date column is actually in datetime, since you won't be able to compare strings in order to filter out only those instances within the first week. See here for converting strings into datetime.

Merge both tables:

df_merged = pd.merge(activity,users,on='UserID')

You get the activity table including the respective dates in each row.

Filter the merged list:

df_merged = df_merged.loc[df_merged['Date'] >= df_merged['RegDate']] # lower bound
df_merged = df_merged.loc[df_merged['Date'] < df_merged['WeekAfterRegDate']] # upper bound

The table now contains only the relevant rows.

Now group by user and sum the revenue:

df_revenue = df_merged.groupby('UserID')['Revenue'].sum()

Okay, thank you. That was very helpful :) – IniStz Mar 11 '21 at 16:12 — IniStz, Mar 11 '21 at 16:12

score 0 · Answer 2 · answered Mar 10 '21 at 11:19

here's what i'd do: first, make a list of the users from the first dataframe

user_list = first_df.UserId.unique().tolist()

then iterate over this list and over the second database something like this:

revenue_total = 0
for i in range(len(user_list)):
    for x in range(len(second_df):
        if second_df['userid'][x] == user_list[i] and second_df['Date'][x] <= first_df['WeekAfterRegDate'][i]:
            revenue_total = revenue_total + second_df['Revenue'][x]

then just simply divide the total revenue with the total users

 total_revenue /len(user_list)

Sum values of a column (Revenue) based on: the values of another column (Date) AND the value of another column (UserId) in pandas

2 Answers2