I have my code set-up below:
for index_outer, row_outer in df_outer.iterrows():
# here in actual code implementation df_devices would be different for every df_outer row but using a fixed example for each user
for index_inner, row_inner in df_devices.iterrows():
device_id = row_inner['devices']
found = check_dates_containment(row_inner['start_date'], row_inner['end_date'], row_outer['start_date'], row_outer['end_date'])
if found == True:
print("All user data can be grabbed just from device_id: " + device_id, "\n")
print(device_id)
data = df_backend.loc[df_backend['devices']==row_inner['devices'], 'data'].iloc[0]
print("Single device data is: ", data)
break
else:
print("Go to next inner loop iteration")
if index_inner == len(df_devices)-1:
print("Must grab data from all device_ids \n")
data_list = []
for index_inner, row_inner in df_devices.iterrows():
device_id = row_inner['devices']
data = df_backend.loc[df_backend['devices']==row_inner['devices'], 'data'].iloc[0]
data_list.append(data)
final_data = data_list # this is what will be returned
print("All device data is: ", final_data)
Here are examples of the dataframes to use:
df_outer = pd.DataFrame({"user": ['sally', 'mark', 'carmen'], "start_date": [datetime(2021,10,17,1,0,0), datetime(2021,10,14,1,0,0), datetime(2021,10,22,1,0,0)], "end_date": [datetime(2021,10,19,1,0,0), datetime(2021,10,22,1,0,0), datetime(2021,10,25,1,0,0)]})
df_devices = pd.DataFrame({"devices": ['a', 'b', 'c'], "start_date": [datetime(2021,10,20,1,0,0), datetime(2021,10,19,1,0,0), datetime(2021,10,16,1,0,0)], "end_date": [datetime(2021,10,24,1,0,0), datetime(2021,10,25,1,0,0), datetime(2021,10,28,1,0,0)]})
df_backend = pd.DataFrame({"devices": ['a', 'b', 'c'], "data": [[1, 25, 6, 8, 56, 8], [4, 565, 75, 76, 34, 46], [45, 65, 75, 324, 75, 23]]})
Below is the function called used:
def check_dates_containment(start_date_inner: datetime, end_date_inner: datetime, start_date_outer: datetime, end_date_outer: datetime) -> bool:
if start_date_inner < start_date_outer and end_date_inner > end_date_outer:
print("Outer dates are contained in inner dates \n")
return True
else:
print("Outer dates are not contained in inner dates \n")
return False
Where df_devices
has a list of all devices, if a certain criteria is i.e found
is True
where dates in df_outer
are contained in one of the dates in df_devices
it takes data from just that single device. If the inner loop finishes and the criteria is not met then I must reloop through the whole inner data frame to extract data from all the devices.
Important Note: that in the real implementation df_devices
would be unique for every user, I'm just using a fixed example for each but in practice df_devices
would be updated just after the first for loop for each user.
Another thing to note is that the data=
from df_data
- would in reality be a function call to my cloud services to extract the needed data. df_data
just emulates this.
The implementation I have above works fine - however I am not too fond of it and feel that it's a bit messy and/or inefficient.
Also from a performance point of view df_outer
in reality is a huge dataframe with 1000s of expected rows so the quicker the execution the better.
I was wondering if there is a cleaner way to implement something like this?