1

I'm learning Python and thought working on a project might be the best way to learn it. I have about 200,000 rows of data in which the data shows list of medication for the patient. Here's a sample of the data.

PTID PTNAME     MME   DRNAME       DRUGNAME                    SPLY STR QTY  FACTOR
1   PATIENT, A  2700    DR, A   OXYCODONE HCL 15 MG             30  15  120 1.5
1   PATIENT, A  2700    DR, B   OXYCODONE HCL 15 MG             30  15  120 1.5
2   PATIENT, B  4050    DR, C   MORPHINE SULFATE ER 15 MG       30  15  270 1
2   PATIENT, B  4050    DR, C   MORPHINE SULFATE ER 15 MG       30  15  270 1
2   PATIENT, B   840    DR, A   OXYCODONE-ACETAMINOPHE 10MG-32  14  10  56  1.5
2   PATIENT, B  1350    DR, C   OXYCODONE-ACETAMINOPHE 5 MG-32  15  5   180 1.5
3   PATIENT, C  1350    DR, C   OXYCODONE-ACETAMINOPHE 5 MG-32  15  5   180 1.5
3   PATIENT, C  1800    DR, D   OXYCODONE-ACETAMINOPHE 10MG-32  30  10  120 1.5

I've been thinking about this a lot and have tried many ways but none of the code produce any results or makes any sense. Honestly, I don't even know where to begin. A little help would be highly appreciated.

So, what I want to do is consolidate the data for each patients and calculate the Total MME for each patient. The DRUGNAME should show the one that has higher MME. In other words, the dataframe should only have one row for each patient.

One thing I did try is

groupby_ptname = semp.groupby('PTNAME').apply(lambda x: x.MME.sum())

which shows unique patient names with total MME, but I'm not sure how to add other variables in this new dataframe.

root
  • 32,715
  • 6
  • 74
  • 87
user1828605
  • 1,723
  • 1
  • 24
  • 63
  • please help us by making a [good reproducible pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) I am unable to paste your df into python no matter how I try. – Steven G Oct 17 '16 at 19:42
  • Could you attach the excel sheet with the above data in it . I will be able solve this by converting it to dictionary , pandas not needed actually . – Shekhar Samanta Oct 17 '16 at 19:44
  • Sorry, I made the data up to protect privacy. I'll try to make the data reproducible. please bear with me, since I'm learning, I have to figure out how to do that. Thank you. – user1828605 Oct 17 '16 at 20:02

2 Answers2

1

Have another look at the documentation for the pandas groupby methods.

Here's something that could work for you:

#first get the total MME for each patient and drug combination
total_mme=semp.groupby(['PTNAME','DRUGNAME'])['MME'].sum()
#this will be a series object with index corresponding to PTNAME and DRUGNAME and values corresponding to the total MME
#now get the indices corresponding to the drug with the max MME total
max_drug_indices=total_mme.groupby(level='PTNAME').idxmax()
#index the total MME with these indices
out=total_mme[max_drug_indices]
Victor Chubukov
  • 1,345
  • 1
  • 10
  • 18
  • This is amazing. I think this is exactly what I needed. Will I be able to add any number of variables in the first group_by? – user1828605 Oct 17 '16 at 20:06
  • 1
    Sure. Just remember that if you do calculations on multiple columns, you will have to specify which one you want to take the `idxmax` of. – Victor Chubukov Oct 17 '16 at 20:12
  • Is there a short-cut to use all columns instead of typing each one of them? – user1828605 Oct 17 '16 at 20:57
  • So, turned out, if I add all variables it will be like using groupby('PTNAME') i.e. it shows everything in groups without aggregating it. But I was able to pick only the ones I need and your solution still worked. I started breaking down the code and learnt a lot from it. Thank you. – user1828605 Oct 18 '16 at 13:36
1

IIUC you can do it this way:

In [62]: df.sort_values('MME').groupby('PTNAME').agg({'MME':'sum', 'DRUGNAME':'last'})
Out[62]:
                                  DRUGNAME    MME
PTNAME
PATIENT, A             OXYCODONE HCL 15 MG   5400
PATIENT, B       MORPHINE SULFATE ER 15 MG  10290
PATIENT, C  OXYCODONE-ACETAMINOPHE 10MG-32   3150

or with .reset_index():

In [64]: df.sort_values('MME').groupby('PTNAME').agg({'MME':'sum', 'DRUGNAME':'last'}).reset_index()
Out[64]:
       PTNAME                        DRUGNAME    MME
0  PATIENT, A             OXYCODONE HCL 15 MG   5400
1  PATIENT, B       MORPHINE SULFATE ER 15 MG  10290
2  PATIENT, C  OXYCODONE-ACETAMINOPHE 10MG-32   3150

UPDATE: more fun with agg() function

In [84]: agg_funcs = {
    ...:     'MME':{'MME_max':'last',
    ...:            'MME_total':'sum'},
    ...:     'DRUGNAME':{'DRUGNAME_max_MME':'last'}
    ...: }
    ...:
    ...: rslt = (df.sort_values('MME')
    ...:          .groupby('PTNAME')
    ...:          .agg(agg_funcs)
    ...:          .reset_index()
    ...: )
    ...: rslt.columns = [tup[1] if tup[1] else tup[0] for tup in rslt.columns]
    ...:

In [85]: rslt
Out[85]:
       PTNAME  MME_total  MME_max                DRUGNAME_max_MME
0  PATIENT, A       5400     2700             OXYCODONE HCL 15 MG
1  PATIENT, B      10290     4050       MORPHINE SULFATE ER 15 MG
2  PATIENT, C       3150     1800  OXYCODONE-ACETAMINOPHE 10MG-32
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • This does the same thing as the answer above. This is also a good solution, although I didn't understand what's `last` in the `agg`? – user1828605 Oct 18 '16 at 13:50
  • 1
    It seems to me that this will pick the drug with the single largest MME as opposed to the largest total MME. Not sure anymore which one you need. – Victor Chubukov Oct 18 '16 at 13:52
  • 2
    "Last" means pick the last row in the group. He/she is sorting by the MME value, so that picks the drug with the highest MME. – Victor Chubukov Oct 18 '16 at 13:54
  • Actually, I just realized that too. Turned out this is the way I want it. I put two solutions together. @VictorChubukov solution shows the largest total MME. But Victor's solution helped me fix another problem too. I think this has been a great learning experience for me. I'll look at this more carefully and then select the answer appropriately. – user1828605 Oct 18 '16 at 13:57
  • This is very exciting. Thank you @VictorChubukov and @MaxU, your solutions have helped me learn a lot and challenge further. For this solution, I changed `'sum'` to a custom function by defining a function that calculates a 90 day average MME for each patient. Moving away from C# to Python definitely now. Thank you both. – user1828605 Oct 18 '16 at 14:41
  • @user1828605, `agg` is an aggregation function, which will be applied on results of `groupby`, operation; `last` - picks last value for each group – MaxU - stand with Ukraine Oct 18 '16 at 15:01
  • @VictorChubukov, `this will pick the drug with the single largest MME` - yes, it'll pick one last `DRUGNAME` with a biggest MME (because of sorting) for __each group__. That's how i understood OPs question... – MaxU - stand with Ukraine Oct 18 '16 at 15:03
  • Literally, it's like you just read my mind. I was doing just that to show the last total_mme and the sum of total_mme. – user1828605 Oct 18 '16 at 15:59
  • @user1828605, i just wanted to show the flexibility of `agg()` method... ;) – MaxU - stand with Ukraine Oct 18 '16 at 16:00
  • 1
    I just wanted to say one thing. Because of this solution, I was able to eliminate about 27% of the patients from the list that the vendor inaccurately identified. This made me look good and learnt a lot along the way. So, credit to both MaxU and VictorChubukov – user1828605 Oct 18 '16 at 16:00