2

I have data that looks like this:

    id          Date        Time        assigned_pat_loc    prior_pat_loc   Activity
0   45546325    2/7/2011    4:29:38     EIAB^EIAB^6         NaN             Admission
1   45546325    2/7/2011    5:18:22     8W^W844^A           EIAB^EIAB^6     Observation
2   45546325    2/7/2011    5:18:22     8W^W844^A           EIAB^EIAB^6    Transfer to 8W
3   45546325    2/7/2011    6:01:44     8W^W858^A           8W^W844^A      Bed Movement
4   45546325    2/7/2011    7:20:44     8W^W844^A           8W^W858^A      Bed Movement
5   45546325    2/9/2011    18:36:03    8W^W844^A           NaN     Discharge-Observation
6   45666555    3/8/2011    20:22:36    EIC^EIC^5           NaN         Admission
7   45666555    3/9/2011    1:08:04     53^5314^A           EIC^EIC^5   Admission
8   45666555    3/9/2011    1:08:04     53^5314^A           EIC^EIC^5   Transfer to 53
9   45666555    3/9/2011    17:03:38    53^5336^A           53^5314^A   Bed Movement

I need to find where there were multiple patients (identified with id column) are in the same room at the same time, the start and end times for those, the dates, and room number (assigned_pat_loc). assigned_pat_loc is the current patient location in the hospital, formatted as “unit^room^bed”.

So far I've done the following:

# Read in CSV file and remove bed number from patient location
data = pd.read_csv('raw_data.csv')
data['assigned_pat_loc'] = data['assigned_pat_loc'].str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True)

# Convert Date column to datetime type
patient_data['Date'] = pd.to_datetime(patient_data['Date'])

# Sort dataframe by date
patient_data.sort_values(by=['Date'], inplace = True)

# Identify rows with duplicate room and date assignments, indicating multiple patients shared room
same_room = patient_data.duplicated(subset = ['Date','assigned_pat_loc'])

# Assign duplicates to new dataframe
df_same_rooms = patient_data[same_room]

# Remove duplicate patient ids but keep latest one
no_dups = df_same_rooms.drop_duplicates(subset = ['id'], keep = 'last')

# Group patients in the same rooms at the same times together
df_shuf = pd.concat(group[1] for group in df_same_rooms.groupby(['Date', 'assigned_pat_loc'], sort=False))

And then I'm stuck at this point:

        id          Date        Time        assigned_pat_loc    prior_pat_loc   Activity
599359  42963403    2009-01-01  12:32:25    11M^11MX           4LD^W463^A       Transfer 
296155  42963484    2009-01-01  16:41:55    11M^11MX                EIC^EIC^2       Transfer
1373    42951976    2009-01-01  15:51:09    11M^11MX            NaN             Discharge
362126  42963293    2009-01-01  4:56:57     11M^11MX               EIAB^EIAB^6      Transfer
362125  42963293    2009-01-01  4:56:57     11M^11MX           EIAB^EIAB^6      Admission
... ... ... ... ... ... ...
268266  46381369    2011-09-09  18:57:31    54^54X         11M^1138^A       Transfer
16209   46390230    2011-09-09  6:19:06     10M^1028               EIAB^EIAB^5      Admission
659699  46391825    2011-09-09  14:28:20    9W^W918            EIAB^EIAB^3      Transfer
659698  46391825    2011-09-09  14:28:20    9W^W918            EIAB^EIAB^3      Admission
268179  46391644    2011-09-09  17:48:53    64^6412            EIE^EIE^3        Admission

Where you can see different patients in the same room at the same time, but I don't know how to extract those intervals of overlap between two different rows for the same room and same times. And then to format it such that the start time and end time are related to the earlier and later times of the transpiring of a shared room between two patients. Below is the desired output.

image

Where r_id is the id of the other patient sharing the same room and length is the number of hours that room was shared.

hulio_entredas
  • 675
  • 1
  • 12
  • groupby with custom grouping function, where you check for overlaps in dates? and then map to desired final format? – Nikolay Jan 26 '23 at 06:18
  • Great, thanks. Could you write up an example? – hulio_entredas Jan 26 '23 at 06:52
  • What exactly does the `Time` column represent, and what do you mean by `start time` and `end time`? If `Time` is just the time of the day at which the patient was moved to the room then the exact period of time a patient stays in a room has to be extracted from the patient's history stored in the column `assigned_pat_loc` and `Activity`, correct? – Timus Jan 26 '23 at 09:48

1 Answers1

1

As suggested, you can use groupby. One more thing you need to take care of is finding the overlapping time. Ideally you'd use datetime which are easy to work with. However you used a different format so we need to convert it first to make the solution easier. Since you did not provide a workable example, I will just write the gist here:

# convert current format to datetime
df['start_datetime'] = pd.to_datetime(df.start_date) + df.start_time.astype('timedelta64[h]')
df['end_datetime'] = pd.to_datetime(df.end_date) + df.end_time.astype('timedelta64[h]')

df = df.sort_values(['start_datetime', 'end_datetime'], ascending=[True, False])

gb = df.groupby('r_id')
for g, g_df in gb:
    g_df['overlap_group'] = (g_df['end_datetime'].cummax().shift() <= g_df['start_datetime']).cumsum()
    print(g_df)

This is a tentative example, and you might need to tweak the datetime conversion and some other minor things, but this is the gist.

The cummax() detects where there is an overlap between the intervals, and cumsum() counts the number of overlapping groups, since it's a counter we can use it as a unique identifier.

I used the following threads:

Group rows by overlapping ranges

python/pandas - converting date and hour integers to datetime

Edit

After discussing it with OP the idea is to take each patient's df and sort it by the date of the event. The first one will be the start_time and the last one would be the end_time.

The unification of the time and date are not necessary for detecting the start and end time as they can sort by date and then by the time to get the same order they would have gotten if they did unify the columns. However for the overlap detection it does make life easier when it's in one column.

gb_patient = df.groupby('id')

patients_data_list = []

for patient_id, patient_df in gb_patient:
   patient_df = patient_df.sort_values(by=['Date', 'Time'])
   patient_data = {
       "patient_id": patient_id,
       "start_time": patient_df.Date.values[0] + patient_df.Time.values[0],
       "end_time": patient_df.Date.values[-1] + patient_df.Time.values[-1]
   }

   patients_data_list.append(patient_data)

new_df = pd.DataFrame(patients_data_list)

After that they can use the above code for the overlaps.

Dr. Prof. Patrick
  • 1,280
  • 2
  • 15
  • 27
  • Thanks for the tips. The start_date and end_date variables do not exist yet. Same with the start_time and end_time. I need to derive those from the dataframe as it is now, with just the Date and Time columns. The screenshot I shared at the bottom of my question is the desired output. So I need to convert the starting point to the desired output. I’m not sure if what I’ve done so far is on the right track, or how to incorporate your suggestions. – hulio_entredas Jan 26 '23 at 15:26
  • if you have the source data you can share it and i'll try to help – Dr. Prof. Patrick Jan 26 '23 at 15:28
  • Sure, thank you. It's a large csv file. Do you know how I can upload that here? – hulio_entredas Jan 26 '23 at 15:31
  • you should edit the original post with a small example of like 4-6 lines just so we get the gist – Dr. Prof. Patrick Jan 26 '23 at 15:32
  • Oh, the first inline code from indices 0-10 is a sample of the original data. It's just that there are a lot of duplicates in the original data because it shows each transfer a single patient had as well. – hulio_entredas Jan 26 '23 at 15:36
  • so how do you infer the end time from that? – Dr. Prof. Patrick Jan 26 '23 at 15:38
  • The larger dataset has 800,000+ rows, so my approach was to find rooms that had multiple patients in them on the same date, but I'm not sure how to take it to the next step where I would identify overlapping times and create a new dataframe with the desired columns. I assumed the end time would be the time that the one of the patients is transferred out of that room or discharged, or (unfortunately) dies. These are shown in the `Activity` column – hulio_entredas Jan 26 '23 at 15:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/251410/discussion-between-dr-prof-patrick-and-hulio-entredas). – Dr. Prof. Patrick Jan 26 '23 at 15:45