1

I have this dataframe:

dff = pd.DataFrame({"country":["cn","cn","cn", "cn","bn","bn","sn"], 'date':['2019-6-6','2019-6-5','2019-7-6','2019-7-9','2017-4-3','2017-3-2','2019-2-9'],'Revenue':[100,200,100,90,300,400,500], "class":["A","C", "B","B", "C","A","B"]})            
    dff

    country date    Revenue class
0   cn  2019-6-6    100       A
1   cn  2019-6-5    200       C
2   cn  2019-7-6    100       B
3   cn  2019-7-9    90        B
4   bn  2017-4-3    300       C
5   bn  2017-3-2    400       A
6   sn  2019-2-9    500       B

I want to achieve three features:

  1. The date 2019-6-6 are modified to month Jun-2019
  2. The revennue are aggregated according to month
  3. The revenue are classified according to the class

The result table should look like this:

country date         A   B   C
0   cn  Jun-2019    100  0   200   
1   cn  Jul-2019    0    190  0
2   bn  Apl-2017    0    0   300
3   bn  Mar-2017    400  0     0
4   sn  Feb-2019    0    500   0

To achieve feature 1,

dff.date = pd.to_datetime(dff.date )
dff.date = dff.date.apply(lambda x : x.strftime("%b-%Y"))
dff
country date    Revenue class
0   cn  Jun-2019    100 A
1   cn  Jun-2019    200 C
2   cn  Jul-2019    100 B
3   cn  Jul-2019    90  B
4   bn  Apr-2017    300 C
5   bn  Mar-2017    400 A
6   sn  Feb-2019    500 B

To achieve feature 2,

dff1= dff.groupby(["date"]).sum()
dff1
    Revenue
date    
Apr-2017    300
Feb-2019    500
Jul-2019    190
Jun-2019    300
Mar-2017    400

To achieve feature 3,

dff.pivot( columns='class')['Revenue']
class   A   B   C
0   100.0   NaN NaN
1   NaN NaN 200.0
2   NaN 100.0   NaN
3   NaN 90.0    NaN
4   NaN NaN 300.0
5   400.0   NaN NaN
6   NaN 500.0   NaN

I feel I am almost there, but i can't figure out how to combine this simultaneously to achieve the final result. Any idea is appreciated!

statistics_learning
  • 427
  • 1
  • 4
  • 14

2 Answers2

1

IIUC

df=dff.groupby([dff.country,dff.date.dt.strftime('%b-%Y'),dff['class']]).sum()['Revenue'].\
    unstack(fill_value=0)
df# df=df.reset_index()
Out[13]: 
class               A    B    C
country date                   
bn      Apr-2017    0    0  300
        Mar-2017  400    0    0
cn      Jul-2019    0  190    0
        Jun-2019  100    0  200
sn      Feb-2019    0  500    0

Or

pd.crosstab(index=[dff.country,dff.date.dt.strftime('%b-%Y')],columns=dff['class'], values=dff['Revenue'], aggfunc='sum').fillna(0)
Out[20]: 
class                 A      B      C
country date                         
bn      Apr-2017    0.0    0.0  300.0
        Mar-2017  400.0    0.0    0.0
cn      Jul-2019    0.0  190.0    0.0
        Jun-2019  100.0    0.0  200.0
sn      Feb-2019    0.0  500.0    0.0
BENY
  • 317,841
  • 20
  • 164
  • 234
0

You can do pivot_table after converting the date:

df.pivot_table(index=['country','date'],
               columns='class', 
               values='Revenue', 
               aggfunc='sum',
               fill_value=0)

Output:

class               A    B    C
country date                   
bn      Apr-2017    0    0  300
        Mar-2017  400    0    0
cn      Jul-2019    0  190    0
        Jun-2019  100    0  200
sn      Feb-2019    0  500    0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thank you very much! It is closing to my ideal output. But I want the `country` and `date` are the same row as `A`, `B` and `C`, and the values in `country` column are not unique, it can be duplicate but same row as other column. You can review my idea result table. Do you know how to get there? – statistics_learning Feb 10 '20 at 16:09
  • chain `reset_index()` to the above? – Quang Hoang Feb 10 '20 at 16:10