1

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:

PregnancyID MotherID    gestationalAgeInWeeks  abdomCirc
0           0           14                     150
0           0           21                     200
1           1           20                     294
1           1           25                     315
1           1           30                     350
2           2           8                      170
2           2           9                      180
2           2           18                     NaN

Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):


(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
    .drop(columns = 'gestationalAgeInWeeks')
    .groupby(['MotherID', 'PregnancyID','tm'])
    .agg('max')
    .unstack()
    )

This results in the following output:

     tm                    1      2     3
MotherID    PregnancyID         
0           0              NaN    200.0 NaN
1           1              NaN    294.0 350.0
2           2              180.0  NaN   NaN

However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.

I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?

sums22
  • 1,793
  • 3
  • 13
  • 25

1 Answers1

1

Chain that with reset_index:

(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
#    .drop(columns = 'gestationalAgeInWeeks')                 # don't need this
    .groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc']   # change here 
    .max().add_prefix('abdomCirc_')                           # here
    .unstack()
    .reset_index()                                            # and here
    )

Or a more friendly version with pivot_table:

(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
    .pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
                 values= 'abdomCirc', aggfunc='max')
    .add_prefix('abdomCirc_')                        # remove this if you don't want the prefix
    .reset_index()
    )

Output:

tm     MotherID  PregnancyID  abdomCirc_1  abdomCirc_2  abdomCirc_3
0   abdomCirc_0  abdomCirc_0          NaN        200.0          NaN
1   abdomCirc_1  abdomCirc_1          NaN        315.0        350.0
2   abdomCirc_2  abdomCirc_2        180.0          NaN          NaN
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Can't believe it's that simple! Been trying to figure this out for the past few days! Thank you! – sums22 Dec 10 '20 at 16:49