Panda groupby on many columns with agg()

Question

I've been asked to analize the DB from a medical record app. So a bunch of record would look like:

So i have to resume more than 3 million records from 2011 to 2014 by PX, i know they repeat since thats the ID for each patient, so a patient should had many visitis to the doctor. How could i group them or resume them by patient.

score 0 · Accepted Answer · edited May 23 '17 at 12:09

I don't know what you mean by "resume", but it looks like all you want to do is only to sort and display data in a nicer way. You can visually group (=order) the records "px- and fecha-wise" like this:

df.set_index(['px', 'fecha'], inplace=True)

EDIT:

When you perform a grouping of the data based on some common property, you have to decide, what kind of aggregation are you going to use on the data in other columns. Simply speaking, once you perform a groupby, you only have one empty field for in each remaining column for each "pacient_id" left, so you must use some aggregation function (e.g. sum, mean, min, avg, count,...) that will return desired representable value of the grouped data.

It is hard to work on your data since they are locked in an image, and it is impossible to tell what you mean by "Age", since this column is not visible, but I hope you can achieve what you want by looking at the following example with dummy data:

import pandas as pd
import numpy as np
from datetime import datetime
import random
from datetime import timedelta

def random_datetime_list_generator(start_date, end_date,n):
    return ((start_date + timedelta(seconds=random.randint(0, int((end_date - start_date).total_seconds())))) for i in xrange(n))

#create random dataframe with 4 sample columns and 50000 rows
rows = 50000
pacient_id = np.random.randint(100,200,rows)
dates = random_datetime_list_generator(pd.to_datetime("2011-01-01"),pd.to_datetime("2014-12-31"),rows)
age = np.random.randint(10,80,rows)
bill = np.random.randint(1,1000,rows)

df = pd.DataFrame(columns=["pacient_id","visited","age","bill"],data=zip(pacient_id,dates,age,bill))

print df.head()

# 1.Only perform statictis of the last visit of each pacient only
stats = df.groupby("pacient_id",as_index=False)["visited"].max()
stats.columns = ["pacient_id","last_visited"]
print stats

# 2. Perform a bit more complex statistics on pacient by specifying desired aggregate function for each column
custom_aggregation = {'visited':{"first visit": 'min',"last visit": "max"}, 'bill':{"average bill" : "mean"}, 'age': 'mean'}

#perform a group by with custom aggregation and renaming of functions
stats = df.groupby("pacient_id").agg(custom_aggregation)
#round floats
stats = stats.round(1)
print stats

Original dummy dataframe looks like so:

   pacient_id             visited  age  bill
0         150 2012-12-24 21:34:17   20   188
1         155 2012-10-26 00:34:45   17   672
2         116 2011-11-28 13:15:18   33   360
3         126 2011-06-03 17:36:10   58   167
4         165 2013-07-15 15:39:31   68   815

First aggregate would look like this:

    pacient_id        last_visited
0          100 2014-12-29 00:01:11
1          101 2014-12-22 06:00:48
2          102 2014-12-26 11:51:41
3          103 2014-12-29 15:01:32
4          104 2014-12-18 15:29:28
5          105 2014-12-30 11:08:29

Second, complex aggregation would look like this:

                       visited                       age         bill
                   first visit          last visit  mean average bill
pacient_id                                                           
100        2011-01-06 06:11:33 2014-12-29 00:01:11  45.2        507.9
101        2011-01-01 20:44:55 2014-12-22 06:00:48  44.0        503.8
102        2011-01-02 17:42:59 2014-12-26 11:51:41  43.2        498.0
103        2011-01-01 03:07:41 2014-12-29 15:01:32  43.5        495.1
104        2011-01-07 18:58:11 2014-12-18 15:29:28  45.9        501.7
105        2011-01-01 03:43:12 2014-12-30 11:08:29  44.3        513.0

This example should get you going. Additionaly, there is a nice SO question about pandas groupby aggregation which may teach you a lot on this topics.

Not exactly what i am looking for. Let me see if i can explain. So i want to "collapse or resume" for example all the consultas from each patient in 2011 so that in the age field i only end with the max value. So by resume i mean like running a return max age function in the age col by year and end only with the age for each patient by year. Am i clear? — alexzaizar09, Jan 07 '17 at 05:19

Panda groupby on many columns with agg()

1 Answers1