2

I am stuck at a time consumption related problem in pandas:

The code looks like this:

df = pd.DataFrame({"IDs": [1, 1, 1, 2, 2, 2, 3, 3, 3],
                   "Month": ["01", "02", "01", "01", "02", "01", "01", "02", "01"],
                   "column1": [0.9, 0.5, 0.3, 0.8, 0.5, 0.1, 0.6, 0.2, 0.8]})

df_list = []
for id in df.IDs.unique():
    temp = df[df.IDs == id]
    temp = temp.groupby("Month").mean()
    temp2 = temp['column1'].ewm(span=3, adjust=True).sum()
    df_list.append(temp2)

Note that unique IDs contains around 500k elements, the original dataframe df contains around 6 mil records.

Now I used tqdm to check the estimated time and it's taking 14-15 hrs to complete. If I have even loop only for temp = df[df.IDs == id] line and the estimated time is same (basically those are pandas function, so it should not make any performance issue). So the problem lies in that line.

Is there any alternative way to do this? Thanks for any suggestion.

BeRT2me
  • 12,699
  • 2
  • 13
  • 31
Stat_prob_001
  • 181
  • 1
  • 7

1 Answers1

1
out = (df.groupby(['IDs', 'Month'])
         .mean()['column1']
         .ewm(span=3, adjust=True)
         .sum())
print(out.reset_index())

Output:

   IDs Month  column1
0    1    01  0.60000
1    1    02  0.80000
2    2    01  0.85000
3    2    02  0.92500
4    3    01  1.16250
5    3    02  0.78125

out = (df.groupby('IDs')
         .apply(lambda x: x.groupby('Month')
                           .mean()['column1']
                           .ewm(span=3, adjust=True)
                           .sum())
         .stack())
print(out.reset_index(name='column1'))

Output:

   IDs Month  column1
0    1    01    0.600
1    1    02    0.800
2    2    01    0.450
3    2    02    0.725
4    3    01    0.700
5    3    02    0.550
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • I do not know why these give different results, but that probably comes from not understanding what `ewm` does. – BeRT2me Jul 23 '22 at 18:44
  • "ewm" is exponential weighted moving average. The result I got is fine, it's same when I was using loop. – Stat_prob_001 Jul 23 '22 at 18:49