1

I have one pandas Dataframe which looks like below:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','cricket','cricket','cricket','cricket'],
       'person_name': ['ramesh','ramesh','ramesh','ramesh','mohit','mahesh','mahesh','mahesh','mahesh'],
           'city': ['mumbai', 'mumbai','delhi','delhi','pune','surat','surat','panji','panji'],
    'person_symbol': ['ram','mum','mum','ram','moh','mah','sur','sur','mah'],
    'person_count': ['10','14','25','20','11','34','23','43','34']})
df = df[['sport_name','person_name','city','person_symbol','person_count']]
print df

  sport_name person_name    city person_symbol person_count
0   football      ramesh  mumbai           ram           10
1   football      ramesh  mumbai           mum           14
2   football      ramesh   delhi           mum           25
3   football      ramesh   delhi           ram           20
4   football       mohit    pune           moh           11
5    cricket      mahesh   surat           mah           34
6    cricket      mahesh   surat           sur           23
7    cricket      mahesh   panji           sur           43
8    cricket      mahesh   panji           mah           34

From this Dataframe, I want to create two column dataframe named as "derived_symbol" and "person_count". For creating it I need to focus on some condition like below:

  • derived_symbol need to form for each unique city and person_symbol.
  • person_count is calculated based on what the derived_symbol is.

Example:

Consider First set of Dataframe i.e. sport_name = football and person_name = rakesh Which are four rows:

  sport_name person_name    city person_symbol person_count
0   football      ramesh  mumbai           ram           10
1   football      ramesh  mumbai           mum           14
2   football      ramesh   delhi           mum           25
3   football      ramesh   delhi           ram           20

In this above four rows, If we look at the city and person_symbol column values we can see four different unique values are available which are mumbai, delhi, ram and mum. so for this above four rows our output looks like below:

derived_symbol                     person_count
football.ramesh.TOTAL.mumbai_count  24
football.ramesh.TOTAL.delhi_count   45
football.ramesh.TOTAL.ram_count     30
football.ramesh.TOTAL.mum_count     39

Final Expected Output:

derived_symbol                      person_count
football.ramesh.TOTAL.mumbai_count  24
football.ramesh.TOTAL.delhi_count   45
football.ramesh.TOTAL.ram_count     30
football.ramesh.TOTAL.mum_count     39
football.mohit.TOTAL.pune_count     11
football.mohit.TOTAL.moh_count      11
cricket.mahesh.TOTAL.surat_count    57
cricket.mahesh.TOTAL.panji_count    77
cricket.mahesh.TOTAL.sur_count      66
cricket.mahesh.TOTAL.mah_count      68

Edit : Dataframe has dates too like below:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football'],
       'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh'],
           'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi'],
    'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram'],
    'person_count': ['10','14','25','20','34','23','43','34'],
    'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month']]
print df

  sport_name person_name    city person_symbol person_count       month
0   football      ramesh  mumbai           ram           10  2017-01-23
1   football      ramesh  mumbai           mum           14  2017-01-23
2   football      ramesh   delhi           mum           25  2017-01-23
3   football      ramesh   delhi           ram           20  2017-01-23
4   football      ramesh  mumbai           ram           34  2017-02-26
5   football      ramesh  mumbai           mum           23  2017-02-26
6   football      ramesh   delhi           mum           43  2017-02-26
7   football      ramesh   delhi           ram           34  2017-02-26

Expected output:

derived_symbol              person_count    month
football.ramesh.TOTAL.mumbai_count  24      2017-01-23
football.ramesh.TOTAL.delhi_count   45      2017-01-23
football.ramesh.TOTAL.ram_count     30      2017-01-23
football.ramesh.TOTAL.mum_count     39      2017-01-23
football.ramesh.TOTAL.mumbai_count  57      2017-02-26
football.ramesh.TOTAL.delhi_count   77      2017-02-26
football.ramesh.TOTAL.ram_count     68      2017-02-26
football.ramesh.TOTAL.mum_count     66      2017-02-26

I did following to calculate person_count month wise:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football'],
           'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh'],
               'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi'],
        'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram'],
        'person_count': ['10','14','25','20','34','23','43','34'],
        'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-26','2017-02-26','2017-02-26','2017-02-26']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month']]

df['person_count'] = df['person_count'].astype(int)

df1=df.set_index(['sport_name','person_name','person_count','month']).stack().reset_index(name='val')

df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'

df2 = df1.groupby(['derived_symbol','month'])['person_count'].sum().reset_index(name='person_count')
print (df2)
ketan
  • 2,732
  • 11
  • 34
  • 80

2 Answers2

2

Here's one way

First change the person_count type to numeric

In [2126]: df.person_count = df.person_count.astype(int)

Reshape your data to get city and person_symbol under one level, and then groupby to get the Total count.

In [2127]: dff = (df.melt(id_vars=['sport_name', 'person_name', 'person_count'])
                    .groupby(['sport_name', 'person_name', 'value']).person_count.sum())

In [2128]: dff
Out[2128]:
sport_name  person_name  value
cricket     mahesh       mah       68
                         panji     77
                         sur       66
                         surat     57
football    mohit        moh       11
                         pune      11
            ramesh       delhi     45
                         mum       39
                         mumbai    24
                         ram       30
Name: person_count, dtype: int32

format the index levels with custom format.

In [2129]: dff.index = ['{0}.{1}.TOTAL.{2}_count'.format(*idx) for idx in dff.index]

In [2130]: dff
Out[2130]:
cricket.mahesh.TOTAL.mah_count        68
cricket.mahesh.TOTAL.panji_count      77
cricket.mahesh.TOTAL.sur_count        66
cricket.mahesh.TOTAL.surat_count      57
football.mohit.TOTAL.moh_count        11
football.mohit.TOTAL.pune_count       11
football.ramesh.TOTAL.delhi_count     45
football.ramesh.TOTAL.mum_count       39
football.ramesh.TOTAL.mumbai_count    24
football.ramesh.TOTAL.ram_count       30
Name: person_count, dtype: int32
Zero
  • 74,117
  • 18
  • 147
  • 154
2

Use:

#convert column to int
df['person_count'] = df['person_count'].astype(int)

#reshape for one column from city and person_symbol columns 
df1=df.set_index(['sport_name','person_name','person_count']).stack().reset_index(name='val')
print (df1)
   sport_name person_name  person_count        level_3     val
0    football      ramesh            10           city  mumbai
1    football      ramesh            10  person_symbol     ram
2    football      ramesh            14           city  mumbai
3    football      ramesh            14  person_symbol     mum
4    football      ramesh            25           city   delhi
5    football      ramesh            25  person_symbol     mum
6    football      ramesh            20           city   delhi
7    football      ramesh            20  person_symbol     ram
8    football       mohit            11           city    pune
9    football       mohit            11  person_symbol     moh
10    cricket      mahesh            34           city   surat
11    cricket      mahesh            34  person_symbol     mah
12    cricket      mahesh            23           city   surat
13    cricket      mahesh            23  person_symbol     sur
14    cricket      mahesh            43           city   panji
15    cricket      mahesh            43  person_symbol     sur
16    cricket      mahesh            34           city   panji
17    cricket      mahesh            34  person_symbol     mah

#concatenate columns
a = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'

#groupby by Series a and aggregate sum
df2 = df1['person_count'].groupby(a.rename('derived_symbol'), sort=False)
                         .sum()
                         .reset_index(name='person_count')
print (df2)
                       derived_symbol  person_count
0  football.ramesh.TOTAL.mumbai_count            24
1     football.ramesh.TOTAL.ram_count            30
2     football.ramesh.TOTAL.mum_count            39
3   football.ramesh.TOTAL.delhi_count            45
4     football.mohit.TOTAL.pune_count            11
5      football.mohit.TOTAL.moh_count            11
6    cricket.mahesh.TOTAL.surat_count            57
7      cricket.mahesh.TOTAL.mah_count            68
8      cricket.mahesh.TOTAL.sur_count            66
9    cricket.mahesh.TOTAL.panji_count            77
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @jezrael- What does that level_3 means? I have another 5 columns also so I'm trying stack() on df1=df.set_index(['sport_name','person_name','person_count','another_col1','another_col2','another_col3','another_col4','another_col5']).stack().reset_index(name='val') but seems like returning empty dataframe? – ketan Sep 12 '17 at 09:15
  • If convert Multiindex to columns and some level have no name, default name is added - `level_3` means level 3 have no name. – jezrael Sep 12 '17 at 09:18
  • @jezrael- ohh got it... I forgot one important point these entries. Is has date too. I need to calculate based on dates. Please see edit. – ketan Sep 12 '17 at 09:34
  • @jezrael- Yeah I did it. please see edit and make me correct if I'm wrong. – ketan Sep 12 '17 at 09:54
  • I think your final solution working nice. If dont want sorting by grouby columsn use `df1.groupby(['derived_symbol','month'], sort=False)['person_count'].sum().reset_index(name='person_count')` – jezrael Sep 12 '17 at 10:11
  • @jezrael- Yeah Thanks for your appreciated help. – ketan Sep 13 '17 at 05:38
  • Glad can help, @kit – jezrael Sep 13 '17 at 05:39
  • Could you do me a favor and check out my answer [here](https://stackoverflow.com/a/46192213/2336654) and vote on it if you think it's good. Thanks! – piRSquared Sep 13 '17 at 08:32
  • @jezrael- If we have to add previous month person_count to the same dataframe then what we need to do for that? – ketan Sep 13 '17 at 10:46
  • Can you create new question? Hard answer without data. Thank you. – jezrael Sep 13 '17 at 11:00
  • @jezrael- I added new SO question for that see https://stackoverflow.com/questions/46196090/how-to-keep-track-of-previous-date-record-column-in-pandas-dataframe – ketan Sep 13 '17 at 11:16