1

I have a huge data frame storing information about patient encounters and variables related to their vitals and lab tests.

str_df.shape
(278546, 34)
str_df.columns
Index(['hadm_id', 'ce_charttime', 'hr', 'sbp', 'dbp', 'map', 'resp', 'temp',
       'spo2', 'glucose', 'base_excess', 'hco3', 'fio2', 'ph', 'paco2', 'sao2',
       'ast', 'bun', 'alp', 'calcium', 'chloride', 'creatinine',
       'bilirubin_direct', 'bilirubin_total', 'lactic_acid', 'magnesium',
       'potassium', 'troponin', 'hematocrit', 'hemoglobin', 'ptt', 'wbc',
       'fibrinogen', 'platelets'],
      dtype='object')

I want to calculate change statistics for each patient encounter (hadm_id) over a period of 24 hours. Please note that this data set is pretty sparse in the sense that, a lot of the variables have NaNs for most of times.

I have come up with this code that was tested a on a very small subset of the data and seems to work:

def percentile(n):
  def percentile_(x):
    return x.quantile(n)
  percentile_.__name__ = f'percentile_{n*100:2.0f}'
  return percentile_

def get_stats(df, var_cols, statistics):
  df.set_index('ce_charttime', inplace=True) # change index to charttime to use '24h' on rolling
  stats_dfs = []

  # loop through each var for which change statistics needs to be computed
  for var in var_cols:
    stats_df = df.groupby('hadm_id')[var].rolling('24h').agg(statistics).reset_index(drop=True) # compute stats
    stats_df.columns = [f'{var}_{col}' for col in stats_df.columns] # create var dependent names for stats
    stats_df = pd.concat([df[var].reset_index(drop=True), stats_df], axis=1) # cat the original var value to preserve order 
    stats_dfs.append(stats_df) # add the stats for this parituclar var to the list

  df.reset_index(inplace=True) # remove charttime index
  df.drop(var_cols, inplace=True, axis=1) # drop the original var cols (we've added it in the for loop)
  return pd.concat([df, *stats_dfs], axis=1) # cat the list into a df
statistics = ['min', 'mean', 'median', 'std', 'var', 'kurt', 'skew', percentile(0.25), percentile(0.75), stats.iqr, 'max']
var_cols = str_df.columns[2:]

str_df = get_stats(str_df.copy(), var_cols, statistics)

# reorder columns
move = ['hadm_id', 'ce_charttime']
order = move + (str_df.columns.drop(move).tolist())
str_df = str_df[order]

Unfortunately, when I try to run this on the entire dataset, it takes far too long and crashes my Jupyter notebook. I wonder if it is because of the for loops I'm using. Is there a more pythonic/pandas type of way of accomplishing this task?

Thanks.

shaun
  • 560
  • 1
  • 11
  • 29
  • How can you even roll `24h`? My system (pandas 0.24.2, python 3.7) only allows rolling with integer windows. – Quang Hoang Oct 08 '19 at 18:08
  • 1
    From here: https://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval. Pandas version '0.25.1'. It is only possible when `pd.datetime` is the index. – shaun Oct 08 '19 at 18:09
  • `stats_df = df.groupby('hadm_id')[var_cols].rolling('24h').agg(statistics)` seems to work fine. – Quang Hoang Oct 08 '19 at 18:33
  • It works yes, but its too slow. That was the question that I asked, whether there was a faster way to do this over all the 32 variables. – shaun Oct 08 '19 at 18:34
  • what I wrote was for 32 variables at once. But sure, it's pretty slow: 6s for 1000 rows, so roughly 1500s for 250k rows. – Quang Hoang Oct 08 '19 at 18:36
  • Did you use the same code that I have shown in my post? ie by looping through each variable and computing a dozen statistics? – shaun Oct 08 '19 at 18:37
  • Not really, I'm saying that `get_stats(str_df.copy(), var_cols, statistics)` can be replaced by the code I posted. – Quang Hoang Oct 08 '19 at 18:39
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/200576/discussion-between-shaun-and-quang-hoang). – shaun Oct 08 '19 at 19:30

0 Answers0