0

I have a pandas multilevel dataframe df to contain the quarterly financial report data for about 2000+ stocks from year 2006 to 2012 . And I am trying to figure out a way to quickly calculate the 'average' values for each data point.

demo_data() is the function to generate the demo data (df = demo_data(stk_qty=2000, col_num=200) can be used to simulate the financial report data):

def demo_data(stk_qty, col_num):
    ''' generate demo data, return multilevel dataframe ''' 
    import random
    import pandas as pd
    rpt_date_template = [(yr+qt) for yr in map(str, range(2006, 2013)) for qt in ['0331','0630','0930','1231']]
    stk_id_list = ['STK'+str(x).zfill(3) for x in range(0, stk_qty)]

    stk_id_column, rpt_date_column = [], []
    for i in range(stk_qty):
        stk_rpt_date_list = rpt_date_template[random.randint(0,8):] # rpt dates with random start
        stk_id_column.extend([stk_id_list[i]] * len(stk_rpt_date_list))
        rpt_date_column.extend(stk_rpt_date_list)

    index_name = ['STK_ID', 'RPT_Date']
    col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]

    first_level_dt  = stk_id_column
    second_level_dt = rpt_date_column

    dt = pd.DataFrame(np.random.randn(len(stk_id_column), col_num), columns=col_name)
    dt[index_name[0]] = first_level_dt
    dt[index_name[1]] = second_level_dt

    multilevel_df = dt.set_index(index_name, drop=True, inplace=False)
    return multilevel_df 

Here is a sample data. (note: sw() is a method to display the four corners data of a big dataframe, source code is at: How to preview a part of a large pandas DataFrame? )

>>> df = demo_data(5,3)
>>> df.sw()
                 COL000  COL001  COL002
STK_ID RPT_Date                        
STK000 20060630  1.8196  0.9519 -1.0526
       20060930 -0.4074 -0.9025  1.3562
       20061231 -1.1750  0.4190 -1.2976
       20070331 -0.5609  1.5190  0.4893
       20070630  0.4580 -0.3804  0.3705
       20070930 -0.4711 -1.1953 -0.0609
       20071231  0.3363  1.1949  1.2802
       20080331  1.6359  0.8355 -0.2763
       20080630  0.2697 -0.8236 -1.7095
       20080930  0.6178 -0.3742 -1.1646
.......................................
STK004 20111231 -0.3198  1.6972 -1.3281
       20120331 -1.1905 -0.4597  0.3695
       20120630 -0.8253 -0.0502 -0.2862
       20120930  0.0059 -1.8535 -1.2107
       20121231  0.5762 -0.2872  0.0993

Index : ['STK_ID', 'RPT_Date']
Column: COL000,COL001,COL002
row: 117    col: 3

The customized average function I want is named as my_avg() and defined as below rules:

1. Q1's average value is (Q4_of_previous_yr + Q1)/2
2. Q2's average value is (Q4_of_previous_yr + Q1 + Q2)/3
3. Q3's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3)/4
4. Q4's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3 + Q4)/5
5. if some of the data points are not provided, just calculate the normal average of available data points

so the my_avg(df) will have below output for each STK_ID:

STK_ID  RPT_Date       COL000            COL001         COL002
STK000       20060630   1.819619705 0.951918984 -1.052639309
             20060930   0.706112476 0.024688028 0.151757352
             20061231   0.079077767 0.156125083 -0.331359614
             20070331   -0.867930112    0.969000466 -0.404129827
             20070630   -0.425943376    0.519205768 -0.145929753
             20070930   -0.437234418    0.090579744 -0.124681449
             20071231   -0.282524858    0.3114374   0.156297097
             20080331   0.986121631 1.015202552 0.501971496
 .......................................
STK004        20111231   xxxxx       xxxxxxx    xxxxxxx

How to write the code for my_avg() ?


Reference: I try to write a temp_solution_avg() function. But it has three issues:

1. the average calculation not include 'Q4_of_previous_yr' data point, so the result is not what I want.
2. data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's data is wrong
3. the calculation speed is very very slow. 
     In [3]: df = demo_data(500,100)
     In [4]: timeit temp_solution_avg(df)
     1 loops, best of 3: 66.3 s per loop



def temp_solution_avg(df):
    ''' return the average , Q1: not change, Q2 : (df.Q1 + df.Q2)/2 ,
        Q3: (df.Q1 + df.Q2 + df.Q3)/3, Q4 : (df.Q1 + df.Q2 + df.Q3 + df.Q4)/4
        data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's
        data is wrong .
    '''
    dt = df.reset_index()
    dt['yr'] = dt['RPT_Date'].str[0:4]
    dt['temp_stk_id'] = dt['STK_ID']
    dt = dt.set_index(['STK_ID','RPT_Date'], drop=True, inplace=False)
    rst = dt.groupby(['temp_stk_id','yr']).transform(pd.expanding_mean)
    return rst
Community
  • 1
  • 1
bigbug
  • 55,954
  • 42
  • 77
  • 96
  • Another approach could involve creating another column - 'Q' with values ['Q1'..'Q4']. You could then pivot on this column to have columns Q1..Q4. It would then be easy to add a previous year column 'PQ4' by Q4.shift. Lastly run a custom avg for each row – user1827356 Apr 07 '13 at 13:47
  • I am not sure to catch your idea. i think finally i can figure out the way to add Q4_of_previous_yr to temp_solution_avg(). but it will be even slower and not a workable solution. the key issue of temp_solution_avg() is dt.groupby(['temp_stk_id','yr']).transform(pd.expanding_mean) is really slow. – bigbug Apr 07 '13 at 14:09
  • I experimented with your dataset for a few minutes, and I blame the slowness on treating 'yr' as a string. Once I get the years as integers, any subsequent group/transform operations run plenty fast; at least, I do not think you can do substiantially better without digging into Cython. One way to get years as integers: ``year = Series(df.index.get_level_values(1), index=df.index).map(lambda x: pd.Timestamp(x).year)`` – Dan Allan Apr 08 '13 at 03:03
  • "Once I get the years as integers, any subsequent group/transform operations run plenty fast", i test it, yes, it is. great tip. thanks. – bigbug Apr 08 '13 at 04:57

0 Answers0