Summarising features with multiple values in Python for Machine Learning model

Question

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:

PregnancyID MotherID    gestationalAgeInWeeks  abdomCirc
0           0           14                     150
0           0           21                     200
1           1           20                     294
1           1           25                     315
1           1           30                     350
2           2           8                      170
2           2           9                      180
2           2           18                     NaN

As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).

I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:

abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks

So my final dataset would look like this:

PregnancyID     MotherID    abdomCirc1st  abdomCirc2nd   abdomCirc3rd
0               0           NaN           200            NaN
1               1           NaN           315            350
2               2           180           NaN            NaN

The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.

But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.

What I want to do is the following:

Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.

I think I have to do something along the lines of:

df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')

But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.

Question is actually about `pandas`, not about `scikit-learn` or `machine-learning` (tags edited). — desertnaut, Dec 02 '20 at 10:52

Mighty Diffy · Answer 1 · 2020-12-02T16:05:28.180

There is a magic command called query. This should do your work for now:

abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()

abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()

abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()

If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)

Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Thank you for this suggestion, but I have gone with @piterbarg's answer as it is more complete and I am more familiar with the commands used in it. — sums22, Dec 03 '20 at 14:35

score 1 · Accepted Answer · edited Dec 04 '20 at 10:59

1

You can try this. a bit of a complicated query but it seems to work:

(df.groupby(['MotherID', 'PregnancyID'])
    .apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
                        .groupby('tm')['abdomCirc']
                        .apply(max))
    .unstack()
)

produces


     tm                    1      2     3
MotherID    PregnancyID         
0           0              NaN    200.0 NaN
1           1              NaN    294.0 350.0
2           2              180.0  NaN   NaN

Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)

For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).

Then we unstack() that moves tm to the column names

You may want to rename this columns later, but I did not bother

Solution 2

Come to think of it you can simplify the above a bit:

(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
    .drop(columns = 'gestationalAgeInWeeks')
    .groupby(['MotherID', 'PregnancyID','tm'])
    .agg('max')
    .unstack()
    )

similar idea, same output.

edited Dec 04 '20 at 10:59

sums22

1,793
3
13
25

answered Dec 02 '20 at 14:03

piterbarg

8,089
2
6
22

This seems to work, thanks. The only thing I would suggest is to divide by 13 (as each trimester is about 13 weeks) and to use ceiling division rather than floor division, like [this](https://stackoverflow.com/a/33299416/7373787). I am trying to edit the answer to reflect this minor suggestion but the edit queue is full, will try to edit again soon. – sums22 Dec 03 '20 at 14:28
good point. I am not sure why the edit queue is full! I certainly do not have any indication to that effect. So is it `(df['gestationalAgeInWeeks']+12)//13`? if you can't still edit in a while I will put this in if you agree – piterbarg Dec 03 '20 at 14:47
yes exactly. So maybe writing it as ((d['gestationalAgeInWeeks']+ 13 - 1 )// 13) and referencing the answer in my comment above so it is clear to any future readers? – sums22 Dec 03 '20 at 14:58
Any idea how to put the query output in a .csv file with the ids? When I print the dataframe I can see the mother and pregnancy ids, but when I try and put the results in a csv file using df.to_csv('abdomCirc.csv', index=False) I am losing the id columns and am only left with the newly created columns? – sums22 Dec 07 '20 at 14:07

Summarising features with multiple values in Python for Machine Learning model

2 Answers2

Solution 2

Linked