I have a DataFrame (df1
) with patients, where each patient (with unique id
) has an admission timestamp:
admission_timestamp id
0 2020-03-31 12:00:00 1
1 2021-01-13 20:52:00 2
2 2020-04-02 07:36:00 3
3 2020-04-05 16:27:00 4
4 2020-03-21 18:51:00 5
I also have a DataFrame (df2
) with for each patient (with unique id
), data for a specific feature. For example:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
2 1 temperature 2020-04-03 13:04:33 36.51
3 2 temperature 2020-04-02 07:44:12 36.45
4 2 temperature 2020-04-08 08:36:00 36.50
Where effective_timestamp
is of type: datetime64[ns], for both columns. The id
s for both dataframes link to the same patients.
In reality there is a lot more data with +- 1 value per minute. What I want is for each patient, only the data for the first X (say 24) hours after the admission timestamp from df1
. So the above would result in:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
3 2 temperature 2020-04-02 07:44:12 36.45
This would thus include first searching for the admission timestamp, and with this timestamp, drop all rows for that patient where the effective_timestamp is not within X hours from the admission timestamp. Here, X should be variable (could be 7, 24, 72, etc). I could not find a similar question on SO. I tried this using panda's date_range but I don't know how to perform that for each patient, with a variable value for X. Any help is appreciated.
Edit: I could also merge the dataframes together so each row in df2
has the admission_timestamp, and then subtract the two columns to get the difference in time. And then drop all rows where difference > X. But this sounds very cumbersome.