39

I have a dataframe, df, which is as follows:

| date      | Revenue |
|-----------|---------|
| 6/2/2017  | 100     |
| 5/23/2017 | 200     |
| 5/20/2017 | 300     |
| 6/22/2017 | 400     |
| 6/21/2017 | 500     |

I need to group the above data by month to get output as:

| date | SUM(Revenue) |
|------|--------------|
| May  | 500          |
| June | 1000         |

I tried this code, but it did not work:

df.groupby(month('date')).agg({'Revenue': 'sum'})

I want to only use Pandas or NumPy and no additional libraries.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Symphony
  • 1,655
  • 4
  • 15
  • 22
  • 3
    `df.groupby(pd.Grouper(key='Date',freq='M')).agg({'Revenue':'sum'})`, this assumes the data type of the date column is datetime – gold_cy Jul 04 '17 at 14:25

6 Answers6

78

Try this:

In [6]: df['date'] = pd.to_datetime(df['date'])

In [7]: df
Out[7]:
        date  Revenue
0 2017-06-02      100
1 2017-05-23      200
2 2017-05-20      300
3 2017-06-22      400
4 2017-06-21      500



In [59]: df.groupby(df['date'].dt.strftime('%B'))['Revenue'].sum().sort_values()
Out[59]:
date
May      500
June    1000
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
shivsn
  • 7,680
  • 1
  • 26
  • 33
  • 1
    up vote because it's the only answer which formats the `date` column properly – gold_cy Jul 04 '17 at 14:42
  • 2
    FYI this gives u a string column for the date which is not as performant nor useful (as real resamplimg / time grouping) – Jeff Jul 04 '17 at 15:30
  • @shivsn: can this be sorted by date- ascending wise (May-500 and then June -1000) ? – Symphony Jul 04 '17 at 19:02
  • what do u mean by df. how to import df?? – Ragulan Sep 05 '18 at 04:42
  • @Ragulan28 df is the name of the DateFrame.For eg df = pd.read_csv('file.csv') – Paras jain Dec 15 '18 at 08:43
  • I want to add 1 more groupby column in your solution how can I do that?like this lead_visits.groupby(['date', 'status'], as_index=False) but instead of 'date', I want months of all dates – Akshit jain Nov 20 '19 at 10:21
  • Method 3 by @yongtw123 [here](https://stackoverflow.com/a/50962588/10916136) is much faster. – Meet Jun 18 '21 at 10:39
34

Try a groupby using a pandas Grouper:

df = pd.DataFrame({'date':['6/2/2017','5/23/2017','5/20/2017','6/22/2017','6/21/2017'],'Revenue':[100,200,300,400,500]})
df.date = pd.to_datetime(df.date)
dg = df.groupby(pd.Grouper(key='date', freq='1M')).sum() # groupby each 1 month
dg.index = dg.index.strftime('%B')

Output:

     Revenue
 May    500
June    1000
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
qbzenker
  • 4,412
  • 3
  • 16
  • 23
  • 1
    This seems to be the best combination between performance and general usability – MichaelA Dec 23 '19 at 10:17
  • 1
    this is the most usable since the months are aggregated in order. This should be the top answer. – JJSanDiego Feb 02 '21 at 03:43
  • 2
    This doesn't work if your dates are across years. Then instead of just one row for month (say May and June above), it will have multiple May's and June's for each year. – Meet Jun 18 '21 at 10:36
  • This answer is totally wrong. It groups by year month not month. – baklarz2048 Oct 27 '22 at 12:30
12

For DataFrame with many rows, using strftime takes up more time. If the date column already has dtype of datetime64[ns] (can use pd.to_datetime() to convert, or specify parse_dates during csv import, etc.), one can directly access datetime property for groupby labels (Method 3). The speedup is substantial.

import numpy as np
import pandas as pd

T = pd.date_range(pd.Timestamp(0), pd.Timestamp.now()).to_frame(index=False)
T = pd.concat([T for i in range(1,10)])
T['revenue'] = pd.Series(np.random.randint(1000, size=T.shape[0]))
T.columns.values[0] = 'date'

print(T.shape) #(159336, 2)
print(T.dtypes) #date: datetime64[ns], revenue: int32

Method 1: strftime

%timeit -n 10 -r 7 T.groupby(T['date'].dt.strftime('%B'))['revenue'].sum()

1.47 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Method 2: Grouper

%timeit -n 10 -r 7 T.groupby(pd.Grouper(key='date', freq='1M')).sum()
#NOTE Manually map months as integer {01..12} to strings

56.9 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Method 3: datetime properties

%timeit -n 10 -r 7 T.groupby(T['date'].dt.month)['revenue'].sum()
#NOTE Manually map months as integer {01..12} to strings

34 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

yongtw123
  • 371
  • 8
  • 19
  • 5
    Note that if you have data from more than 1 year, methods 1 and 3 aggregate over them whereas method 2 does not. Also, the result from method 1 is sorted alphabetically. – HenriV Sep 12 '18 at 14:08
  • how to arrange the outputs in ascending order? since it is now array not df – alex3465 May 28 '21 at 12:17
3

This will work better.

Try this:

# Explicitly convert to date
df['Date'] = pd.to_datetime(df['Date'])
# Set your date column as index 
df.set_index('Date',inplace=True) 

# For monthly use 'M', If needed for other freq you can change.
df[revenue].resample('M').sum()

This code gives the same result as shivsn's answer on the first post.

But the thing is we can do a lot more operations in this mentioned code.

It is recommended to use this:

>>> df['Date'] = pd.to_datetime(df['Date'])
>>> df.set_index('Date',inplace=True)
>>> df['withdrawal'].resample('M').sum().sort_values()
Date
2019-10-31     28710.00
2019-04-30     31437.00
2019-07-31     39728.00
2019-11-30     40121.00
2019-05-31     46495.00
2020-02-29     57751.10
2019-12-31     72469.13
2020-01-31     76115.78
2019-06-30     76947.00
2019-09-30     79847.04
2020-03-31     97920.18
2019-08-31    205279.45
Name: withdrawal, dtype: float64

where shivsn's code does the same.

>>> df.groupby(df['Date'].dt.strftime('%B'))['withdrawal'].sum().sort_values()
Date
October       28710.00
April         31437.00
July          39728.00
November      40121.00
May           46495.00
February      57751.10
December      72469.13
January       76115.78
June          76947.00
September     79847.04
March         97920.18
August       205279.45
Name: withdrawal, dtype: float64
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 1
    Why would you have to set the date as the `DataFrame` index when `groupby` works just fine? What happens if you have same date repeated? I would also consider the date formatting as first step, the highest voted answer, because it is not clear from the question if that column will be properly formatted. – user7440787 Apr 06 '20 at 17:50
  • Either it is a duplicate or unique , it is going to sum up all those values. It acts as same as groupby.And apart from groupby options , By setting the date column as index you can perform lot more basic operations. Thanks. – Jeywanth Kannan Apr 07 '20 at 20:14
0

Try this:

  1. Change the date column into datetime format.

    ---> df['Date'] = pd.to_datetime(df['Date'])

  2. Insert a new row in the data frame which has month like [May, 'June']

    ---> df['months'] = df['date'].apply(lambda x:x.strftime('%B'))

    ---> here x is date which take from date column in data frame.

  3. Now aggregate the aggregate data in the month column and sum the revenue.

    --->response_data_frame = df.groupby('months')['Revenue'].sum()

    ---->print(response_data_frame)

Output:

month Revenue
May 500
June 1000
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Shubham gupta
  • 203
  • 1
  • 4
0
df['Month'] = pd.DatetimeIndex(df['date']).month_name()

Using this you should get

date Revenue Month
6/2/2017 100 June
5/23/2017 200 May
5/20/2017 300 May
6/22/2017 400 June
6/21/2017 500 June
Kenrich
  • 23
  • 5