I have two dataframe:
df = pd.DataFrame({'ID': ['1','1','1','2','2','3','4','4'], \
'ward': ['icu', 'surgery','icu', 'neurology','neurology','obstetrics','OPD', 'surgery'], \
'start_date': ['2016-10-22 18:19:19', '2016-10-24 10:20:00','2016-10-24 12:41:30', '2016-11-09 19:41:30','2016-11-09 23:20:00','2016-11-08 09:45:00','2016-10-15 09:15:00','2016-10-15 12:15:01'], \
'end_date': ['2016-10-24 10:10:19', '2016-10-24 12:40:30','2016-10-26 11:15:00', '2016-11-09 22:11:00','2016-11-11 13:30:00','2016-11-09 07:25:00','2016-10-15 12:15:00','2016-10-17 17:25:00'] })
df1 = pd.DataFrame({'ID': ['1','2','4'], \
'ward': ['radiology', 'rehabilitation','radiology'], \
'date': ['2016-10-23 10:50:00', '2016-11-24 10:20:00','2016-10-15 18:41:30']})
I want to populate the data shown in df1
into df
by comparing the ID and if the date
in the df1
falls somewhere between the start_date
and end_date
of df
. If both conditions match, I would like to add another row (data taken from df1
) in the df
for that specific ID. Where I add the new row, I also would like to change the date/time on the previous and the next row.
What I want is the following as an end result:
ID ward start_date end_date
0 1 icu 2016-10-22 18:19:19 2016-10-23 10:50:00
1 1 radiology 2016-10-23 10:50:00 2016-10-23 10:50:00
2 1 icu 2016-10-23 10:50:00 2016-10-24 10:10:19
3 1 surgery 2016-10-24 10:20:00 2016-10-24 12:40:30
4 1 icu 2016-10-24 12:41:30 2016-10-26 11:15:00
5 2 neurology 2016-11-09 19:41:30 2016-11-09 22:11:00
6 2 neurology 2016-11-09 23:20:00 2016-11-11 13:30:00
7 3 obstetrics 2016-11-08 09:45:00 2016-11-09 07:25:00
8 4 OPD 2016-10-15 09:15:00 2016-10-15 12:15:00
9 4 hematology 2016-10-15 12:15:00 2016-10-15 18:41:30
10 4 radiology 2016-10-15 18:41:30 2016-10-15 18:41:30
11 4 hematology 2016-10-15 18:41:30 2016-10-17 17:25:00
In this example, ID 1 and ID 4 met the condition in both dataframes. Just explaining the example of ID 1, initially ID 1 moved from icu -> surgery -> icu, but after comparing and populating new row, the final data shows that ID 1 moves from icu -> radiology -> icu -> surgery -> icu. now ID 1 has five row instead of 3 and in every row, start_date and end_date is updated as well.
The dataset (df) is large and includes 1 Million rows and I do not know what method should I use to get the right result efficiently. Any help will be appreciated.