I have a dataframe with patient visit-level data (each row is a unique individual's hospital visit). There is a column for admit_timestamp
and another for discharge_timestamp
. I need to calculate several metrics (e.g., total patients per hour) for each hour in a day (0-23). The final result would be a (24 x C
) matrix, where each row is an hour, and C
depends on the aggregated metrics I end up calculating. Note that this is different from this question because I can't simply use pd.resample
on one column---I have to account for the entire span of time a patient is in the hospital, which may be more than 1 day. I'm trying to find an efficient implementation, since it's an operation that will occur on a few GBs worth of data every few weeks (in batch). I'm hoping this community can point me in the right direction. Consider the following reproducible example:
Say, we had 3 patients, with admit/discharge timestamps as follows:
df = pd.DataFrame({
'patient_id':[1,2,3],
'admit_timestamp':['2021-01-01 00:00:00', '2021-01-01 00:00:00', '2021-01-01 22:00:00'],
'discharge_timestamp':['2021-01-01 02:00:00', '2021-01-02 00:00:00', '2021-01-01 23:30:00']
})
df.admit_timestamp = pd.to_datetime(df.admit_timestamp)
df.discharge_timestamp = pd.to_datetime(df.discharge_timestamp)
If I wanted to simply calculate the total number of patients per hour (the easiest of the metrics I need), I would expect a table like this:
Hour | Count
0 2
1 2
2 1
. .
. .
. .
23 1.5
I started to play around with combining pd.interval_range
to create a list of relevant hours, with pd.explode
to melt the data so that each row represents a unique patient-hour, but not sure whether this is an optimal approach.