I have a huge data frame storing information about patient encounters and variables related to their vitals and lab tests.
str_df.shape
(278546, 34)
str_df.columns
Index(['hadm_id', 'ce_charttime', 'hr', 'sbp', 'dbp', 'map', 'resp', 'temp',
'spo2', 'glucose', 'base_excess', 'hco3', 'fio2', 'ph', 'paco2', 'sao2',
'ast', 'bun', 'alp', 'calcium', 'chloride', 'creatinine',
'bilirubin_direct', 'bilirubin_total', 'lactic_acid', 'magnesium',
'potassium', 'troponin', 'hematocrit', 'hemoglobin', 'ptt', 'wbc',
'fibrinogen', 'platelets'],
dtype='object')
I want to calculate change statistics for each patient encounter (hadm_id
) over a period of 24 hours. Please note that this data set is pretty sparse in the sense that, a lot of the variables have NaN
s for most of times.
I have come up with this code that was tested a on a very small subset of the data and seems to work:
def percentile(n):
def percentile_(x):
return x.quantile(n)
percentile_.__name__ = f'percentile_{n*100:2.0f}'
return percentile_
def get_stats(df, var_cols, statistics):
df.set_index('ce_charttime', inplace=True) # change index to charttime to use '24h' on rolling
stats_dfs = []
# loop through each var for which change statistics needs to be computed
for var in var_cols:
stats_df = df.groupby('hadm_id')[var].rolling('24h').agg(statistics).reset_index(drop=True) # compute stats
stats_df.columns = [f'{var}_{col}' for col in stats_df.columns] # create var dependent names for stats
stats_df = pd.concat([df[var].reset_index(drop=True), stats_df], axis=1) # cat the original var value to preserve order
stats_dfs.append(stats_df) # add the stats for this parituclar var to the list
df.reset_index(inplace=True) # remove charttime index
df.drop(var_cols, inplace=True, axis=1) # drop the original var cols (we've added it in the for loop)
return pd.concat([df, *stats_dfs], axis=1) # cat the list into a df
statistics = ['min', 'mean', 'median', 'std', 'var', 'kurt', 'skew', percentile(0.25), percentile(0.75), stats.iqr, 'max']
var_cols = str_df.columns[2:]
str_df = get_stats(str_df.copy(), var_cols, statistics)
# reorder columns
move = ['hadm_id', 'ce_charttime']
order = move + (str_df.columns.drop(move).tolist())
str_df = str_df[order]
Unfortunately, when I try to run this on the entire dataset, it takes far too long and crashes my Jupyter notebook. I wonder if it is because of the for
loops I'm using. Is there a more pythonic/pandas type of way of accomplishing this task?
Thanks.