3

I have a question regarding resampling of DataFrames.

import pandas as pd
df = pd.DataFrame([['2005-01-20', 10], ['2005-01-21', 20], 
                   ['2005-01-27', 40], ['2005-01-28', 50]],
                   columns=['date', 'num'])

# Convert the column to datetime 
df['date'] = pd.to_datetime(df['date'])

# Resample and aggregate results by week
df = df.resample('W', on='date')['num'].sum().reset_index()

print(df.head())
# OUTPUT: 
#         date  num
# 0 2005-01-23   30
# 1 2005-01-30   90 

Everything works as expected, but I would like to better understand what exactly resample(),['num'] and sum() do here.

QUESTION #1

Why the following happens:

The result of df.resample('W', on='date') is DatetimeIndexResampler.

The result of df.resample('W', on='date')['num'] is pandas.core.groupby.SeriesGroupBy.

The result of df.resample('W', on='date')['num'].sum() is

date
2005-01-23    30
2005-01-30    90
Freq: W-SUN, Name: num, dtype: int64

QUESTION #2

Is there a way to produce the same results without resampling? For example, using groupby.

Konstantin
  • 2,937
  • 10
  • 41
  • 58

1 Answers1

3

Answer1

  1. As the docs says, .resample returns a Resampler Object. Hence you get DatetimeIndexResampler because date is a datetime object.

  2. Now, you get <pandas.core.groupby.SeriesGroupBy because you are looking for Series from the dataframe based of off the Resampler object.

    Oh by the way,

    df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num'] Would return <pandas.core.groupby.SeriesGroupBy as well.

  3. Now when you do .sum(), you are getting the sum over the requested axis of the dataframe. You get a Series because you are doing sum over the pandas.core.series.Series.

Answer2

You can achieve results using groupby with the help from Grouper as follow:

df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num'].sum()

Output:

date
2005-01-23    30
2005-01-30    90
Name: num, dtype: int64
Community
  • 1
  • 1
harvpan
  • 8,571
  • 2
  • 18
  • 36
  • Thank you for your reply! Do we retrieve columns at step 2 that we previously grouped by `date`? If we retrieve multiple columns, not only `num`, can we use different aggregation methods like `sum()` and `mean()` for different columns? – Konstantin May 24 '18 at 09:59
  • 1
    @Konstantin, yes, you can as many as columns as you need. Apply `.sum()` or `.mean()`. For example, if you had some another column like `val`. You get the some using `df.resample('W', on='date')[['num', 'val']].sum()` for both the columns. – harvpan May 24 '18 at 13:53
  • But is it possible to use `.sum()` for one of them and `.mean()` for another? – Konstantin May 24 '18 at 14:52
  • 1
    @Konstantin, absolutely. https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns – harvpan May 24 '18 at 15:28
  • @Konstantin Glad to help. Happy coding. – harvpan May 24 '18 at 15:55