0

So I have the following data frame:

In [1]:
import pandas as pd
import numpy as np

df = pd.DataFrame([['0068edf090ceaf1356', '0068edf090ceaf1356', '0068edf090ceaf1356','0068edf090ceaf1356', '0068edf090ceaf1356', '009eft67eaa133cea4', '009eft67eaa133cea4', '009eft67eaa133cea4', '009eft67eaa133cea4'], [-26, -26 -36, 81, 181, -51, -81, 61, 71], [18.0, 27.0, 53.0, 43.6, 12.4, 24.4, 63.0,72.8]], columns = ['uuid', 'prom_completed_date', 'prom_score'])  

In [2]: df
Out[2]: 
         uuid       prom_completed_date        prom_score
0068edf090ceaf1356          -26                    18.0
0068edf090ceaf1356          -26                    18.0
0068edf090ceaf1356          -36                    27.0
0068edf090ceaf1356           81                    53.0
0068edf090ceaf1356          181                    43.6 
009eft67eaa133cea4          -51                    12.4
009eft67eaa133cea4          -81                    24.4
009eft67eaa133cea4           61                    63.0
009eft67eaa133cea4           71                    72.8

Where each patient has multiple entries. Bearing in mind that the first two entries are not duplicates but are the same as there are other columns with different options so rather than the pre-op average being (18+18+27)/3 it should be (18+27)/2.

I want to create a new data frame where each uuid has three new columns:

  • an average PROM score where the values in the prom_completed_date_relative are negative
  • an average PROM score where the values in the prom_completed_date_relative are positive
  • the difference between the two above averages.

I'm not exactly sure how to do the coding for this in python, whilst ensuring that the uuid's are grouped.

I'm looking for something like this:

In [3]:
Out[3]:
      uuid            postop_avgPROM    preop_avgPROM    difference
0068edf090ceaf1356          48.3             22.5           25.8
009eft67eaa133cea4          67.9             18.4           49.5

I have tried the following:

df.query("prom_completed_date_relative">0).groupby("uuid")["prom_score"].mean().reset_index(name="postop_avgPROM_score")

but it does not seem to work, unfortunately.

Thank you.

  • Welcome to stack overflow! Please provide a [mcve] including sample input and sample output as text in the body of your question, not as a picture or external link. See [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – G. Anderson Jun 10 '20 at 14:41
  • Thank you! So sorry, I have amended the question now to include the relevant information. – Fromanothersea Jun 10 '20 at 16:16

2 Answers2

0

Here is a solution for what you tried.

This will give you the prom_scoreaverage for each combination of uuid value and prom_completed_date_relative positive/negative.

df_avg = df.groupby(["uuid",df["prom_completed_date_relative"]>=0])["prom_score"].mean().reset_index()

You will need to process it a little more in order to get the columns the way you want.

Using .pivot() on df_avg :

df_avg = df_avg.pivot(index="uuid", columns="prom_completed_date_relative" ,values="prom_score")
AMH
  • 502
  • 2
  • 10
  • Hi, thank you so much for your response! But the averages don't come out with the correct PROM averages as, for example, looking at my above-preferred outcome in the question, the first patient has two prom_completed_date at -26 and the score being the same for both so should have been (18+27)/2 rather than (18+18+27)/3, as the two first entries were recorded on the same date, so count as one score entry. Sorry for the confusion. – Fromanothersea Jun 10 '20 at 16:33
0

There may be a more concise way to achieve your result, but here is a multi-step way that is pretty clear

#get each column
post_op=df[df['prom_completed_date']>0].groupby('uuid').mean()['prom_score']
pre_op=df[df['prom_completed_date']<0].groupby('uuid').mean()['prom_score']
difference=post_op-pre_op

#concat them together
df1=pd.concat([post_op,pre_op,difference], axis=1)

#rename the columns
df1.columns=['postop_avgPROM','preop_avgPROM','difference']

df1
                    postop_avgPROM  preop_avgPROM   difference
uuid            
0068edf090ceaf1356  48.3            22.5            25.8
009eft67eaa133cea4  67.9            18.4            49.5
G. Anderson
  • 5,815
  • 2
  • 14
  • 21
  • Hi, thank you so much for your response! Same as the answer before, unfortunately, the averages don't come out with the correct PROM averages as, for example, looking at my above-preferred outcome in the question, the first patient has two prom_completed_date at -26 and the score being the same for both so should have been (18+27)/2 rather than (18+18+27)/3, as the two first entries were recorded on the same date, so count as one score entry. Very sorry for the confusion. – Fromanothersea Jun 10 '20 at 16:50
  • The confusion, I think, stems from the fact that this output matches your "I'm looking for something like this:" outcome exactly. You didn't describe anything in your question about needing to remove duplicate entries, nor does your expected result reflect that need. – G. Anderson Jun 10 '20 at 16:57
  • Oh no sorry, it's not a duplicate entry, there are other columns in the data frame but are not required which have different entries hence it appears to be a duplicate. Sorry for not explaining properly. – Fromanothersea Jun 10 '20 at 17:04
  • So for our intents and purposes, it would be a duplicate. But again, your question doesn't describe the behavior you are talking about, and your expected output does not reflect it. We can only help with an issue that is explained, so it would be worth going back to your original question and doing an [edit] to make it a true [mcve], or possibly even delete it and ask again with the new complete parameters – G. Anderson Jun 10 '20 at 17:23
  • Ah yes, so sorry I will amend. Thank you so much for your answer. – Fromanothersea Jun 10 '20 at 21:30