0

Code to generate random database for question (minimum reproducible issue):

df_random = pd.DataFrame(np.random.random((2000,3)))
df_random['order_date'] = pd.date_range(start='1/1/2015', 
periods=len(df_random), freq='D')
df_random['customer_id'] = np.random.randint(1, 20, df_random.shape[0])
df_random

Output df_random

        0          1               2    order_date  customer_id
0   0.018473    0.970257    0.605428    2015-01-01    12
    ... ... ... ... ... ...
    1999    0.800139    0.746605    0.551530    2020-06-22  11

Code to extract mean unique transactions month and year wise

for y in (2015,2019):
for x in (1,13):
    df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
    df2.sort_values(['customer_id','order_date'],inplace=True)
    df2["days"] = df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D"))
    df_mean=round(df2['days'].mean(),2)
    data2 = data.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
    print(data2)

Expected output

  Mean         Month  Year
0   5.00          1  2015
    .......................
11  6.62         12  2015

..............Mean values of days after which one transaction occurs in order_date for years 2016 and 2017 Jan to Dec

36  6.03          1  2018
..........................
47  6.76         12  2018
48  8.40          1  2019
.......................
48  8.40         12  2019

Basically I want single dataframe starting from 2015 Jan month to 2019 December

Instead of the expected output I am getting dataframe from Jan 2015 to Dec 2018 , then again Jan 2015 data and then the entire dataset repeats again from 2015 to 2018 many more times.

Please help

noob
  • 3,601
  • 6
  • 27
  • 73
  • with your current code, i am just getting 4 lines of output – anky Jan 02 '20 at 14:35
  • Hi I have edited code since i was not getting responses to reduce the problem as much as possible. pls see now – noob Jan 02 '20 at 14:37

2 Answers2

1

Try this:

data2 = pd.DataFrame([])
for y in range(2015,2020):
    for x in range(1,13):
        df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
        df_mean=df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")).mean().round(2)
        data2 = data2.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)

print(data2)

Try this :

df_random.order_date = pd.to_datetime(df_random.order_date)
df_random = df_random.set_index(pd.DatetimeIndex(df_random['order_date']))
output = df_random.groupby(pd.Grouper(freq="M"))[[0,1,2]].agg(np.mean).reset_index()
output['month'] = output.order_date.dt.month
output['year'] = output.order_date.dt.year
output = output.drop('order_date', axis=1)
output

Output

0   1   2   month   year
0   0.494818    0.476514    0.496059    1   2015
1   0.451611    0.437638    0.536607    2   2015
2   0.476262    0.567519    0.528129    3   2015
3   0.519229    0.475887    0.612433    4   2015
4   0.464781    0.430593    0.445455    5   2015
... ... ... ... ... ...
61  0.416540    0.564928    0.444234    2   2020
62  0.553787    0.423576    0.422580    3   2020
63  0.524872    0.470346    0.560194    4   2020
64  0.530440    0.469957    0.566077    5   2020
65  0.584474    0.487195    0.557567    6   2020

abhilb
  • 5,639
  • 2
  • 20
  • 26
  • Pls correct the loop itself. I have not shared the exact issue as it was too complex but i need to use the loop – noob Jan 02 '20 at 14:31
  • 3
    There are very rare situations where you actually need `for loop` in pandas. And this is coming from someone who is using it every day. There are many methods we can chain to achieve our desired results. Your question is a typical [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). – Erfan Jan 02 '20 at 14:39
  • Error: float object has no parameter round – noob Jan 02 '20 at 15:37
  • [Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Jan 03 '20 at 23:06
0

Avoid any looping and simply include year and month in groupby calculation:

np.random.seed(1022020)
...
# ASSIGN MONTH AND YEAR COLUMNS, THEN SORT COLUMNS
df_random = (df_random.assign(month = lambda x: x['order_date'].dt.month,
                              year = lambda x: x['order_date'].dt.year)
                      .sort_values(['customer_id', 'order_date']))

# GROUP BY CALCULATION
df_random["days"] = (df_random.groupby(["customer_id", "year", "month"])["order_date"]
                              .apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")))

# FINAL MEAN AGGREGATION BY YEAR AND MONTH
final_df = (df_random.groupby(["year", "month"], as_index=False)["days"].mean().round(2)
                     .rename(columns={"days":"mean"}))

print(final_df.head())

#    year  month   mean
# 0  2015      1   8.43
# 1  2015      2   5.87
# 2  2015      3   4.88
# 3  2015      4  10.43
# 4  2015      5   8.12

print(final_df.tail())

#     year  month  mean
# 61  2020      2  8.27
# 62  2020      3  8.41
# 63  2020      4  8.81
# 64  2020      5  9.12
# 65  2020      6  7.00

For multiple aggregates, replace the single groupby.mean() to groupby.agg():

final_df = (df_random.groupby(["year", "month"], as_index=False)["days"]
                     .agg(['count', 'min', 'mean', 'median', 'max'])
                     .rename(columns={"days":"mean"}))

print(final_df.head())
#             count  min   mean  median   max
# year month
# 2015 1         14  1.0   8.43     5.0  25.0
#      2         15  1.0   5.87     5.0  17.0
#      3         16  1.0   4.88     5.0   9.0
#      4         14  1.0  10.43     7.5  23.0
#      5         17  2.0   8.12     8.0  17.0

print(final_df.tail())
#             count  min  mean  median   max
# year month
# 2020 2         15  1.0  8.27     6.0  21.0
#      3         17  1.0  8.41     7.0  16.0
#      4         16  1.0  8.81     7.0  20.0
#      5         16  1.0  9.12     7.0  22.0
#      6          7  2.0  7.00     7.0  17.0
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • What if I wish to include median too instead of just mean? – noob Jan 02 '20 at 15:33
  • 1
    For multiple aggregates, use `groupby.agg()` instead of just `groupby.mean()`. – Parfait Jan 02 '20 at 16:01
  • That's interesting as days diff calculation should be within each customer in each month of each year. Can you please set up a reproducible example showing what you claim? Be sure to set the seed if using random data like I do here. – Parfait Jan 03 '20 at 23:09