0

I have a data set with first column is the Date, Second column is the Collaborator and third column is price paid.

enter image description here

I want to get the mean price paid of every Collaborator for the previous month. I want to return a table tha looks like this:

enter image description here

I used some solutions like rolling but i could get only the past X days, not the past month

jvfach
  • 41
  • 4
  • 2
    Can you post your expected output along with reproducible code for others to try the solution quickly. – SunilG Jun 01 '21 at 12:12

3 Answers3

0

Pandas has a built-in method .rolling

x = 3 # This is where you define the number of previous entries
df.rolling(x).mean() # Apply the mean

Hence:

df['LastMonthMean'] = df['Price'].rolling(x).mean()

I'm not sure how you want to calculate your mean but hope this helps

Jamie T
  • 56
  • 1
  • 6
0

I would first add month column and then use groupby and would retrieve the first item

import pandas as pd
df = pd.DataFrame({
    'month': [1, 1, 1, 2, 2, 2],
    'collaborator': [1, 2, 3, 1, 2, 3],
    'price': [100, 200, 300, 400, 500, 600]
})

df.groupby(['collaborator', 'month']).mean()
Amin Ba
  • 1,603
  • 1
  • 13
  • 38
0

The rolling() method would have to be applied to the DataFrame grouped by Collaborator to obtain the mean sale price of every collaborator in the previous month. Because the data would be grouped by and summarised, the number of data points would not match the original dataset, thus not allowing you to easily append the result to the original dataset.

If you use a DatetimeIndex in your DataFrame it will be considered a time series and then you can resample() the data more easily.

I have produced a replicable solution below, based on your initial question in which I resample the data and append the last month's mean to it. Thanks to @akilat90 for the function to generate random dates within a range.

import pandas as pd
import numpy as np

def random_dates(start, end, n=10):
    # Function copied from @akilat90
    # Available on https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
    
    start_u = pd.to_datetime(start).value//10**9
    end_u = pd.to_datetime(end).value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

size = 1000

index = random_dates(start='2021-01-01', end='2021-06-30', n=size).sort_values()

collaborators = np.random.randint(low=1, high=4, size=size)

prices = np.random.uniform(low=5., high=25., size=size)

data = pd.DataFrame({'Collaborator': collaborators,
                     'Price': prices}, index=index)

monthly_mean = data.groupby('Collaborator').resample('M')['Price'].mean()

data_final = pd.merge(data, monthly_mean, how='left', left_on=['Collaborator', data.index.month],
         right_on=[monthly_mean.index.get_level_values('Collaborator'), monthly_mean.index.get_level_values(1).month + 1])

data_final.index = data.index
data_final = data_final.drop('key_1', axis=1)
data_final.columns = ['Collaborator', 'Price', 'LastMonthMean']

This is the output:

                     Collaborator      Price  LastMonthMean
2021-01-31 04:26:16             2  21.838910            NaN
2021-01-31 05:33:04             2  19.164086            NaN
2021-01-31 12:32:44             2  24.949444            NaN
2021-01-31 12:58:02             2   8.907224            NaN
2021-01-31 14:43:07             1   7.446839            NaN
2021-01-31 18:38:11             3   6.565208            NaN
2021-02-01 00:08:25             2  24.520149      15.230642
2021-02-01 09:25:54             2  20.614261      15.230642
2021-02-01 09:59:48             2  10.879633      15.230642
2021-02-02 10:12:51             1  22.134549      14.180087
2021-02-02 17:22:18             2  24.469944      15.230642

As you can see, the records in January 2021, the first month in this time series, do not have a valid Last Month Mean, unlike the records in February.