1

This question is referenced from the this SO Question.

I want to perform some data analysis on pandas Dataframe. I have one dataframe like below:

                   derived_symbol  sport_name person_name      city  \
0       football.RAM.mumbai.ram_count    football         RAM    mumbai   
1       football.RAM.mumbai.mum_count    football         RAM    mumbai   
2        football.RAM.delhi.mum_count    football         RAM     delhi   
3        football.RAM.delhi.ram_count    football         RAM     delhi   
4       football.RAM.mumbai.ram_count    football         RAM    mumbai   
5       football.RAM.mumbai.mum_count    football         RAM    mumbai   
6        football.RAM.delhi.mum_count    football         RAM     delhi   
7        football.RAM.delhi.ram_count    football         RAM     delhi   
8       basketball.MAH.pune.mah_count  basketball         MAH      pune   
9     basketball.MAH.nagpur.mah_count  basketball         MAH    nagpur   
10     basketball.MAH.TOTAL.mah_count  basketball         MAH  No Entry   
11  basketball.MAH.TOTAL.nagpur_count  basketball         MAH    nagpur   
12    basketball.MAH.TOTAL.pune_count  basketball         MAH      pune   
13     football.RAM.TOTAL.delhi_count    football         RAM     delhi   
14     football.RAM.TOTAL.delhi_count    football         RAM     delhi   
15       football.RAM.TOTAL.mum_count    football         RAM  No Entry   
16       football.RAM.TOTAL.mum_count    football         RAM  No Entry   
17    football.RAM.TOTAL.mumbai_count    football         RAM    mumbai   
18    football.RAM.TOTAL.mumbai_count    football         RAM    mumbai   
19       football.RAM.TOTAL.ram_count    football         RAM  No Entry   
20       football.RAM.TOTAL.ram_count    football         RAM  No Entry   

   person_symbol       month sir  person_count  
0            ram  2017-01-23   a            10  
1            mum  2017-01-23   a            14  
2            mum  2017-01-23   a            25  
3            ram  2017-01-23   a            20  
4            ram  2017-02-22   b            34  
5            mum  2017-02-22   b            23  
6            mum  2017-02-22   b            43  
7            ram  2017-02-22   b            34  
8            mah  2017-03-03   c            10  
9            mah  2017-03-03   c            20  
10           mah  2017-03-03   c            30  
11      No Entry  2017-03-03   c            20  
12      No Entry  2017-03-03   c            10  
13      No Entry  2017-01-23   a            45  
14      No Entry  2017-02-22   b            77  
15           mum  2017-01-23   a            39  
16           mum  2017-02-22   b            66  
17      No Entry  2017-01-23   a            24  
18      No Entry  2017-02-22   b            57  
19           ram  2017-01-23   a            30  
20           ram  2017-02-22   b            68

I want to add previous_person_count column to this Dataframe. "month" column of this dataframe contains date in the format of "yyyy-mm-dd". So we need to look at the month i.e. "mm" field to identify which month it is.

By seeing this month we need to put "person_count" value into the next month's "previous_person_count" value.

Exceted Output:

                derived_symbol  sport_name person_name      city  \
0       football.RAM.mumbai.ram_count    football         RAM    mumbai   
1       football.RAM.mumbai.mum_count    football         RAM    mumbai   
2        football.RAM.delhi.mum_count    football         RAM     delhi   
3        football.RAM.delhi.ram_count    football         RAM     delhi   
4       football.RAM.mumbai.ram_count    football         RAM    mumbai   
5       football.RAM.mumbai.mum_count    football         RAM    mumbai   
6        football.RAM.delhi.mum_count    football         RAM     delhi   
7        football.RAM.delhi.ram_count    football         RAM     delhi   
8       basketball.MAH.pune.mah_count  basketball         MAH      pune   
9     basketball.MAH.nagpur.mah_count  basketball         MAH    nagpur   
10     basketball.MAH.TOTAL.mah_count  basketball         MAH  No Entry   
11  basketball.MAH.TOTAL.nagpur_count  basketball         MAH    nagpur   
12    basketball.MAH.TOTAL.pune_count  basketball         MAH      pune   
13     football.RAM.TOTAL.delhi_count    football         RAM     delhi   
14     football.RAM.TOTAL.delhi_count    football         RAM     delhi   
15       football.RAM.TOTAL.mum_count    football         RAM  No Entry   
16       football.RAM.TOTAL.mum_count    football         RAM  No Entry   
17    football.RAM.TOTAL.mumbai_count    football         RAM    mumbai   
18    football.RAM.TOTAL.mumbai_count    football         RAM    mumbai   
19       football.RAM.TOTAL.ram_count    football         RAM  No Entry   
20       football.RAM.TOTAL.ram_count    football         RAM  No Entry   

   person_symbol       month sir  person_count      previous_person_count
0            ram  2017-01-23   a            10      0
1            mum  2017-01-23   a            14      0
2            mum  2017-01-23   a            25      0
3            ram  2017-01-23   a            20      0
4            ram  2017-02-22   b            34      10
5            mum  2017-02-22   b            23      14
6            mum  2017-02-22   b            43      25
7            ram  2017-02-22   b            34      20
8            mah  2017-03-03   c            10      0
9            mah  2017-03-03   c            20      0
10           mah  2017-03-03   c            30      0
11      No Entry  2017-03-03   c            20      0
12      No Entry  2017-03-03   c            10      0
13      No Entry  2017-01-23   a            45      0
14      No Entry  2017-02-22   b            77      45
15           mum  2017-01-23   a            39      0
16           mum  2017-02-22   b            66      39
17      No Entry  2017-01-23   a            24      0
18      No Entry  2017-02-22   b            57      24
19           ram  2017-01-23   a            30      0
20           ram  2017-02-22   b            68      30

Edit Reference Code:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
           'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
               'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
        'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
        'person_count': ['10','14','25','20','34','23','43','34','10','20'],
        'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-22','2017-02-22','2017-02-22','2017-02-22','2017-03-03','2017-03-03'],
        'sir': ['a','a','a','a','b','b','b','b','c','c']})

df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]

df['person_name'] = df['person_name'].apply(symbology)

df['person_count'] = df['person_count'].astype(int)

print df
df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val')

df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'

df2 = df1.groupby(['derived_symbol','month','sir','sport_name','person_name','level_5','val'])['person_count'].sum().reset_index(name='person_count')

df3 = df2.set_index(['derived_symbol','month','sir','sport_name','person_name','person_count','level_5'])['val'].unstack().fillna('No Entry').rename_axis(None, 1).reset_index()

df['derived_symbol'] = df['sport_name'] + '.' + df['person_name'] + '.' + df['city'] + "."+ df['person_symbol'] + '_count'
df4 = pd.concat([df, df3]).reset_index(None)
print df3
del df4['index']
df4 = df4[['derived_symbol','sport_name','person_name','city','person_symbol','month','sir','person_count']]
print df4

Convenience:

d = {'city': {0: 'mumbai',
  1: 'mumbai',
  2: 'delhi',
  3: 'delhi',
  4: 'mumbai',
  5: 'mumbai',
  6: 'delhi',
  7: 'delhi',
  8: 'pune',
  9: 'nagpur',
  10: 'No Entry',
  11: 'nagpur',
  12: 'pune',
  13: 'delhi',
  14: 'delhi',
  15: 'No Entry',
  16: 'No Entry',
  17: 'mumbai',
  18: 'mumbai',
  19: 'No Entry',
  20: 'No Entry'},
 'derived_symbol': {0: 'football.RAM.mumbai.ram_count',
  1: 'football.RAM.mumbai.mum_count',
  2: 'football.RAM.delhi.mum_count',
  3: 'football.RAM.delhi.ram_count',
  4: 'football.RAM.mumbai.ram_count',
  5: 'football.RAM.mumbai.mum_count',
  6: 'football.RAM.delhi.mum_count',
  7: 'football.RAM.delhi.ram_count',
  8: 'basketball.MAH.pune.mah_count',
  9: 'basketball.MAH.nagpur.mah_count',
  10: 'basketball.MAH.TOTAL.mah_count',
  11: 'basketball.MAH.TOTAL.nagpur_count',
  12: 'basketball.MAH.TOTAL.pune_count',
  13: 'football.RAM.TOTAL.delhi_count',
  14: 'football.RAM.TOTAL.delhi_count',
  15: 'football.RAM.TOTAL.mum_count',
  16: 'football.RAM.TOTAL.mum_count',
  17: 'football.RAM.TOTAL.mumbai_count',
  18: 'football.RAM.TOTAL.mumbai_count',
  19: 'football.RAM.TOTAL.ram_count',
  20: 'football.RAM.TOTAL.ram_count'},
 'month': {0: '2017-01-23',
  1: '2017-01-23',
  2: '2017-01-23',
  3: '2017-01-23',
  4: '2017-02-22',
  5: '2017-02-22',
  6: '2017-02-22',
  7: '2017-02-22',
  8: '2017-03-03',
  9: '2017-03-03',
  10: '2017-03-03',
  11: '2017-03-03',
  12: '2017-03-03',
  13: '2017-01-23',
  14: '2017-02-22',
  15: '2017-01-23',
  16: '2017-02-22',
  17: '2017-01-23',
  18: '2017-02-22',
  19: '2017-01-23',
  20: '2017-02-22'},
 'person_count': {0: 10,
  1: 14,
  2: 25,
  3: 20,
  4: 34,
  5: 23,
  6: 43,
  7: 34,
  8: 10,
  9: 20,
  10: 30,
  11: 20,
  12: 10,
  13: 45,
  14: 77,
  15: 39,
  16: 66,
  17: 24,
  18: 57,
  19: 30,
  20: 68},
 'person_name': {0: 'RAM',
  1: 'RAM',
  2: 'RAM',
  3: 'RAM',
  4: 'RAM',
  5: 'RAM',
  6: 'RAM',
  7: 'RAM',
  8: 'MAH',
  9: 'MAH',
  10: 'MAH',
  11: 'MAH',
  12: 'MAH',
  13: 'RAM',
  14: 'RAM',
  15: 'RAM',
  16: 'RAM',
  17: 'RAM',
  18: 'RAM',
  19: 'RAM',
  20: 'RAM'},
 'person_symbol': {0: 'ram',
  1: 'mum',
  2: 'mum',
  3: 'ram',
  4: 'ram',
  5: 'mum',
  6: 'mum',
  7: 'ram',
  8: 'mah',
  9: 'mah',
  10: 'mah',
  11: 'No Entry',
  12: 'No Entry',
  13: 'No Entry',
  14: 'No Entry',
  15: 'mum',
  16: 'mum',
  17: 'No Entry',
  18: 'No Entry',
  19: 'ram',
  20: 'ram'},
 'sir': {0: 'a',
  1: 'a',
  2: 'a',
  3: 'a',
  4: 'b',
  5: 'b',
  6: 'b',
  7: 'b',
  8: 'c',
  9: 'c',
  10: 'c',
  11: 'c',
  12: 'c',
  13: 'a',
  14: 'b',
  15: 'a',
  16: 'b',
  17: 'a',
  18: 'b',
  19: 'a',
  20: 'b'},
 'sport_name': {0: 'football',
  1: 'football',
  2: 'football',
  3: 'football',
  4: 'football',
  5: 'football',
  6: 'football',
  7: 'football',
  8: 'basketball',
  9: 'basketball',
  10: 'basketball',
  11: 'basketball',
  12: 'basketball',
  13: 'football',
  14: 'football',
  15: 'football',
  16: 'football',
  17: 'football',
  18: 'football',
  19: 'football',
  20: 'football'}}
ketan
  • 2,732
  • 11
  • 34
  • 80

1 Answers1

1

What you can do is merge the dataframe into itself, after computing the month number (from the date), and the previous one as well.

Let's start with computing those 2 values. For convenience purposes, I firstly converted the raw month string value to datetime, which allowed me to use relativedelta to compute the previous month. This ensures behaviour is correct, even after a change of year.

In [7]: df['month'] = pd.to_datetime(df['month'])

In [8]: df['month_num'] = df['month'].apply(lambda x: x.strftime('%Y-%m'))

In [9]: from dateutil.relativedelta import relativedelta

In [10]: df['previous_month_num'] = df['month'].apply(lambda x: (x + relativedelta(months=-1)).strftime('%Y-%m'))

In [11]: df
Out[11]:
     city      month person_count person_name person_symbol sir  sport_name  \
0  mumbai 2017-01-23           10      ramesh           ram   a    football
1  mumbai 2017-01-23           14      ramesh           mum   a    football
2   delhi 2017-01-23           25      ramesh           mum   a    football
3   delhi 2017-01-23           20      ramesh           ram   a    football
4  mumbai 2017-02-22           34      ramesh           ram   b    football
5  mumbai 2017-02-22           23      ramesh           mum   b    football
6   delhi 2017-02-22           43      ramesh           mum   b    football
7   delhi 2017-02-22           34      ramesh           ram   b    football
8    pune 2017-03-03           10      mahesh           mah   c  basketball
9  nagpur 2017-03-03           20      mahesh           mah   c  basketball

  month_num previous_month_num
0   2017-01            2016-12
1   2017-01            2016-12
2   2017-01            2016-12
3   2017-01            2016-12
4   2017-02            2017-01
5   2017-02            2017-01
6   2017-02            2017-01
7   2017-02            2017-01
8   2017-03            2017-02
9   2017-03            2017-02

We can then merge the dataframe into itself, using the computed month values as merging keys:

In [12]: relevant_columns = ['city', 'person_symbol', 'sport_name']

In [13]: pd.merge(df, df, left_on=relevant_columns + ['previous_month_num'], right_on=rele
    ...: vant_columns + ['month_num'], how='left', suffixes=('', '_previous'))[list(df.col
    ...: umns) + ['person_count_previous']].fillna(0).drop(['month_num', 'previous_month_n
    ...: um'], axis=1)
Out[13]:
     city      month person_count person_name person_symbol sir  sport_name  \
0  mumbai 2017-01-23           10      ramesh           ram   a    football
1  mumbai 2017-01-23           14      ramesh           mum   a    football
2   delhi 2017-01-23           25      ramesh           mum   a    football
3   delhi 2017-01-23           20      ramesh           ram   a    football
4  mumbai 2017-02-22           34      ramesh           ram   b    football
5  mumbai 2017-02-22           23      ramesh           mum   b    football
6   delhi 2017-02-22           43      ramesh           mum   b    football
7   delhi 2017-02-22           34      ramesh           ram   b    football
8    pune 2017-03-03           10      mahesh           mah   c  basketball
9  nagpur 2017-03-03           20      mahesh           mah   c  basketball

  person_count_previous
0                     0
1                     0
2                     0
3                     0
4                    10
5                    14
6                    25
7                    20
8                     0
9                     0

Some comments:

  • I used ['city', 'person_symbol', 'sport_name'] as the reference columns, but feel free to add some more, depending on what exactly you want to achieve.
  • The new column is named person_count_previous, but you can rename it, should it be best for you.
  • By default, when there is no match for the previous count, the column will be NaN. I replaced the values with 0, thanks to fillna.
  • I removed the "temporary" columns using drop, but feel free to keep them.
3kt
  • 2,543
  • 1
  • 17
  • 29
  • @3kt- Thanks for you appreciate help. I have one question like can we do the same thing for weeks as well? If the data is available for each week. Can we calculate previous week counts from that data? – ketan Sep 18 '17 at 15:53
  • @3kt- I did it. Thanks – ketan Sep 18 '17 at 17:15