0

I have two dataframe:

df = pd.DataFrame({'ID': ['1','1','1','2','2','3','4','4'], \
                   'ward': ['icu', 'surgery','icu', 'neurology','neurology','obstetrics','OPD', 'surgery'], \
                   'start_date': ['2016-10-22 18:19:19', '2016-10-24 10:20:00','2016-10-24 12:41:30', '2016-11-09 19:41:30','2016-11-09 23:20:00','2016-11-08 09:45:00','2016-10-15 09:15:00','2016-10-15 12:15:01'], \
                   'end_date': ['2016-10-24 10:10:19', '2016-10-24 12:40:30','2016-10-26 11:15:00', '2016-11-09 22:11:00','2016-11-11 13:30:00','2016-11-09 07:25:00','2016-10-15 12:15:00','2016-10-17 17:25:00'] })
df1 = pd.DataFrame({'ID': ['1','2','4'], \
                   'ward': ['radiology', 'rehabilitation','radiology'], \
                   'date': ['2016-10-23 10:50:00', '2016-11-24 10:20:00','2016-10-15 18:41:30']})

I want to populate the data shown in df1 into df by comparing the ID and if the date in the df1 falls somewhere between the start_date and end_date of df. If both conditions match, I would like to add another row (data taken from df1) in the df for that specific ID. Where I add the new row, I also would like to change the date/time on the previous and the next row.

What I want is the following as an end result:

   ID        ward           start_date             end_date
0   1         icu  2016-10-22 18:19:19  2016-10-23 10:50:00
1   1   radiology  2016-10-23 10:50:00  2016-10-23 10:50:00
2   1         icu  2016-10-23 10:50:00  2016-10-24 10:10:19
3   1     surgery  2016-10-24 10:20:00  2016-10-24 12:40:30
4   1         icu  2016-10-24 12:41:30  2016-10-26 11:15:00
5   2   neurology  2016-11-09 19:41:30  2016-11-09 22:11:00
6   2   neurology  2016-11-09 23:20:00  2016-11-11 13:30:00
7   3  obstetrics  2016-11-08 09:45:00  2016-11-09 07:25:00
8   4         OPD  2016-10-15 09:15:00  2016-10-15 12:15:00
9   4  hematology  2016-10-15 12:15:00  2016-10-15 18:41:30
10  4   radiology  2016-10-15 18:41:30  2016-10-15 18:41:30
11  4  hematology  2016-10-15 18:41:30  2016-10-17 17:25:00

In this example, ID 1 and ID 4 met the condition in both dataframes. Just explaining the example of ID 1, initially ID 1 moved from icu -> surgery -> icu, but after comparing and populating new row, the final data shows that ID 1 moves from icu -> radiology -> icu -> surgery -> icu. now ID 1 has five row instead of 3 and in every row, start_date and end_date is updated as well.

The dataset (df) is large and includes 1 Million rows and I do not know what method should I use to get the right result efficiently. Any help will be appreciated.

HT121
  • 431
  • 2
  • 6
  • 14

1 Answers1

1

By interpretting the guidance from here I have the following method:

import pandas as pd

df = pd.DataFrame({'ID': ['1','1','1','2','2','3','4','4'], \
                   'ward': ['icu', 'surgery','icu', 'neurology','neurology','obstetrics','OPD', 'surgery'], \
                   'start_date': ['2016-10-22 18:19:19', '2016-10-24 10:20:00','2016-10-24 12:41:30', '2016-11-09 19:41:30','2016-11-09 23:20:00','2016-11-08 09:45:00','2016-10-15 09:15:00','2016-10-15 12:15:01'], \
                   'end_date': ['2016-10-24 10:10:19', '2016-10-24 12:40:30','2016-10-26 11:15:00', '2016-11-09 22:11:00','2016-11-11 13:30:00','2016-11-09 07:25:00','2016-10-15 12:15:00','2016-10-17 17:25:00'] })
df1 = pd.DataFrame({'ID': ['1','2','4'], \
                   'ward': ['radiology', 'rehabilitation','radiology'], \
                   'date': ['2016-10-23 10:50:00', '2016-11-24 10:20:00','2016-10-15 18:41:30']})

# Converting str datetime to datetime objects
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)
df1.date = pd.to_datetime(df1.date)

# Change the index to intervals
df_temp = df.copy()
df_temp.index = pd.IntervalIndex.from_arrays(df_temp['start_date'],df_temp['end_date'],closed='both')

# Find the interval to split
def find_interval(row):
    try:
        return df_temp.loc[row.date].loc[(df_temp.ID == row.ID)].iloc[0]
    except KeyError:
        # This value does not fall within any interval in df
        return

# These are all the rows to be altered:
to_remove = df1.apply(find_interval, axis=1).dropna()

"""
to_remove
  ID     ward           start_date             end_date
0  1      icu  2016-10-22 18:19:19  2016-10-24 10:10:19
2  4  surgery  2016-10-15 12:15:01  2016-10-17 17:25:00 """


# Create 3 new rows for every matching
def new_rows(row):
    try:
        # Create the new rows by taking information from the existing row
        existing = df_temp.loc[row.date].loc[(df_temp.ID == row.ID)].iloc[0]
        out = pd.DataFrame(dict(
            ID=[row.ID] * 3,
            ward=[existing.ward, row.ward, existing.ward],
            start_date=[existing.start_date, row.date, row.date],
            end_date=[row.date, row.date, existing.end_date]
        ))
        return out
    except KeyError:
        return

to_add = pd.concat(df1.apply(new_rows, axis=1).values)

"""
to_add
  ID       ward          start_date            end_date
0  1        icu 2016-10-22 18:19:19 2016-10-23 10:50:00
1  1  radiology 2016-10-23 10:50:00 2016-10-23 10:50:00
2  1        icu 2016-10-23 10:50:00 2016-10-24 10:10:19
0  4    surgery 2016-10-15 12:15:01 2016-10-15 18:41:30
1  4  radiology 2016-10-15 18:41:30 2016-10-15 18:41:30
2  4    surgery 2016-10-15 18:41:30 2016-10-17 17:25:00 """


# Remove the 'to_remove'
new = pd.concat([df,to_remove]).drop_duplicates(keep=False)

# Add the 'to_add'
new = pd.concat([new, to_add])

# Sort the finished dataframe
new = new.sort_values(['ID', 'start_date']).reset_index(drop=True)


new
   ID        ward           start_date             end_date
0   1         icu  2016-10-22 18:19:19  2016-10-23 10:50:00
1   1   radiology  2016-10-23 10:50:00  2016-10-23 10:50:00
2   1         icu  2016-10-23 10:50:00  2016-10-24 10:10:19
3   1     surgery  2016-10-24 10:20:00  2016-10-24 12:40:30
4   1         icu  2016-10-24 12:41:30  2016-10-26 11:15:00
5   2   neurology  2016-11-09 19:41:30  2016-11-09 22:11:00
6   2   neurology  2016-11-09 23:20:00  2016-11-11 13:30:00
7   3  obstetrics  2016-11-08 09:45:00  2016-11-09 07:25:00
8   4         OPD  2016-10-15 09:15:00  2016-10-15 12:15:00
9   4     surgery  2016-10-15 12:15:01  2016-10-15 18:41:30
10  4   radiology  2016-10-15 18:41:30  2016-10-15 18:41:30
11  4     surgery  2016-10-15 18:41:30  2016-10-17 17:25:00
Dillon
  • 997
  • 4
  • 13
  • Thank you for putting much efforts in it. I tried to copy paste your code and run it but it gave me error at 'to_add = pd.concat(df1.apply(new_rows, axis=1).values)'. ValueError: All objects passed were None. I also printed to_remove but that is an empty series. Any idea? I commented out the parts where you showed the output of every step. – HT121 Feb 15 '19 at 17:00
  • @HT121 Can you try it again, copy and paste everything I have written (I have just tried it and I get the results shown) – Dillon Feb 15 '19 at 17:09
  • still the same error. i am using python 3.5 and pandas version is 0.21.0. – HT121 Feb 15 '19 at 17:14
  • File "/dataMerge.py", line 87, in to_add = pd.concat(df1.apply(new_rows, axis=1).values) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/reshape/concat.py", line 212, in concat copy=copy) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/reshape/concat.py", line 263, in __init__ raise ValueError('All objects passed were None') ValueError: All objects passed were None – HT121 Feb 15 '19 at 17:18
  • It seems like an update on `IntervalIndex` within pandas at version 0.24.1 [Interval and Period data may now be stored in a Series or DataFrame](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#storing-interval-and-period-data-in-series-and-dataframe) whereas from the last update at version 0.20.0 it is [just an introductory feature](https://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#intervalindex) – Dillon Feb 15 '19 at 17:23
  • @HT121 Have you been able to update your pandas and see if this works or is that not an option? – Dillon Feb 18 '19 at 08:57