0

I have a dataframe with patient visit-level data (each row is a unique individual's hospital visit). There is a column for admit_timestamp and another for discharge_timestamp. I need to calculate several metrics (e.g., total patients per hour) for each hour in a day (0-23). The final result would be a (24 x C) matrix, where each row is an hour, and C depends on the aggregated metrics I end up calculating. Note that this is different from this question because I can't simply use pd.resample on one column---I have to account for the entire span of time a patient is in the hospital, which may be more than 1 day. I'm trying to find an efficient implementation, since it's an operation that will occur on a few GBs worth of data every few weeks (in batch). I'm hoping this community can point me in the right direction. Consider the following reproducible example:

Say, we had 3 patients, with admit/discharge timestamps as follows:

df = pd.DataFrame({
    'patient_id':[1,2,3],
    'admit_timestamp':['2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 22:00:00'],
    'discharge_timestamp':['2021-01-01 02:00:00', '2021-01-02 00:00:00', '2021-01-01 23:30:00']
})

df.admit_timestamp = pd.to_datetime(df.admit_timestamp)
df.discharge_timestamp = pd.to_datetime(df.discharge_timestamp)

If I wanted to simply calculate the total number of patients per hour (the easiest of the metrics I need), I would expect a table like this:

Hour | Count
0      2
1      2
2      1
.      .
.      .
.      .
23     1.5

I started to play around with combining pd.interval_range to create a list of relevant hours, with pd.explode to melt the data so that each row represents a unique patient-hour, but not sure whether this is an optimal approach.

ricniet
  • 115
  • 6
  • Please have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide a [mcve] with sample input and expected output so that we can better understand how to help – G. Anderson Feb 02 '21 at 15:45
  • You mention that patients could be present for more than 1 day, can you please clarify how you handle that in your counting? Also, you list "1.5" total in hour 23, which is the sum on 1 + 0.5 for two patients... you want to count total hours as a float on admission and discharges that don't occur on the hour? Or are you trying to count if they were present at any point during the hour? A few more examples in your dataframe would help. – Rick M Feb 03 '21 at 13:56

0 Answers0