Keep only rows from first X hours with starting point from another dataframe

Question

I have a DataFrame (df1) with patients, where each patient (with unique id) has an admission timestamp:

    admission_timestamp id
0   2020-03-31 12:00:00 1
1   2021-01-13 20:52:00 2
2   2020-04-02 07:36:00 3
3   2020-04-05 16:27:00 4
4   2020-03-21 18:51:00 5

I also have a DataFrame (df2) with for each patient (with unique id), data for a specific feature. For example:

    id name        effective_timestamp  numerical_value
0   1  temperature  2020-03-31 13:00:00 36.47
1   1  temperature  2020-03-31 13:04:33 36.61
2   1  temperature  2020-04-03 13:04:33 36.51
3   2  temperature  2020-04-02 07:44:12 36.45
4   2  temperature  2020-04-08 08:36:00 36.50

Where effective_timestamp is of type: datetime64[ns], for both columns. The ids for both dataframes link to the same patients.

In reality there is a lot more data with +- 1 value per minute. What I want is for each patient, only the data for the first X (say 24) hours after the admission timestamp from df1. So the above would result in:

    id name        effective_timestamp  numerical_value
0   1  temperature  2020-03-31 13:00:00 36.47
1   1  temperature  2020-03-31 13:04:33 36.61
3   2  temperature  2020-04-02 07:44:12 36.45

This would thus include first searching for the admission timestamp, and with this timestamp, drop all rows for that patient where the effective_timestamp is not within X hours from the admission timestamp. Here, X should be variable (could be 7, 24, 72, etc). I could not find a similar question on SO. I tried this using panda's date_range but I don't know how to perform that for each patient, with a variable value for X. Any help is appreciated.

Edit: I could also merge the dataframes together so each row in df2 has the admission_timestamp, and then subtract the two columns to get the difference in time. And then drop all rows where difference > X. But this sounds very cumbersome.

score 1 · Accepted Answer · answered Mar 30 '21 at 18:14

Let's use pd.DateOffset

First get the value of admission_timestamp for a given patient id, and convert it to pandas datetime.

Let's say id = 1

>>admissionTime = pd.to_datetime(df1[df1['id'] == 1]['admission_timestamp'].values[0])
>>admissionTime
Timestamp('2020-03-31 12:00:00')

Now, you just need to use pd.DateOffset to add 24 hours to it.

>>admissionTime += pd.DateOffset(hours=24)

Now, just look for the rows where id=1 and effective_timestamp < admissionTime

>>df2[(df2['id'] == 1) & (df2['effective_timestamp']<admissionTime)]
   id         name effective_timestamp  numerical_value
0   1  temperature 2020-03-31 13:00:00            36.47
1   1  temperature 2020-03-31 13:04:33            36.61

Thanks a lot, I did not know about pd.DateOffset. Solved! – sander Mar 30 '21 at 20:07 — sander, Mar 30 '21 at 20:07

Keep only rows from first X hours with starting point from another dataframe

1 Answers1