1

I have my code set-up below:

for index_outer, row_outer in df_outer.iterrows():


# here in actual code implementation df_devices would be different for every df_outer row but using a fixed example for each user


for index_inner, row_inner in df_devices.iterrows():
    
    device_id = row_inner['devices']
    found = check_dates_containment(row_inner['start_date'], row_inner['end_date'], row_outer['start_date'], row_outer['end_date'])
    
    if found == True:
        print("All user data can be grabbed just from device_id: " + device_id, "\n")
        print(device_id)
        data = df_backend.loc[df_backend['devices']==row_inner['devices'], 'data'].iloc[0]
        print("Single device data is: ", data)
        break
    else:
        print("Go to next inner loop iteration")
        
    if index_inner == len(df_devices)-1:
        print("Must grab data from all device_ids \n")
        data_list = []
        for index_inner, row_inner in df_devices.iterrows():
            
            device_id = row_inner['devices']
            data = df_backend.loc[df_backend['devices']==row_inner['devices'], 'data'].iloc[0]
            data_list.append(data)
        
        final_data = data_list # this is what will be returned
        print("All device data is: ", final_data)

Here are examples of the dataframes to use:

df_outer = pd.DataFrame({"user": ['sally', 'mark', 'carmen'], "start_date": [datetime(2021,10,17,1,0,0), datetime(2021,10,14,1,0,0), datetime(2021,10,22,1,0,0)], "end_date": [datetime(2021,10,19,1,0,0), datetime(2021,10,22,1,0,0), datetime(2021,10,25,1,0,0)]})

df_devices = pd.DataFrame({"devices": ['a', 'b', 'c'], "start_date": [datetime(2021,10,20,1,0,0), datetime(2021,10,19,1,0,0), datetime(2021,10,16,1,0,0)], "end_date": [datetime(2021,10,24,1,0,0), datetime(2021,10,25,1,0,0), datetime(2021,10,28,1,0,0)]})

df_backend = pd.DataFrame({"devices": ['a', 'b', 'c'], "data": [[1, 25, 6, 8, 56, 8], [4, 565, 75, 76, 34, 46], [45, 65, 75, 324, 75, 23]]})

Below is the function called used:

def check_dates_containment(start_date_inner: datetime, end_date_inner: datetime, start_date_outer: datetime, end_date_outer: datetime) -> bool:
    
    if start_date_inner < start_date_outer and end_date_inner > end_date_outer:
        print("Outer dates are contained in inner dates \n")
        return True
    else:
        print("Outer dates are not contained in inner dates \n")
        return False

Where df_devices has a list of all devices, if a certain criteria is i.e found is True where dates in df_outer are contained in one of the dates in df_devices it takes data from just that single device. If the inner loop finishes and the criteria is not met then I must reloop through the whole inner data frame to extract data from all the devices.

Important Note: that in the real implementation df_devices would be unique for every user, I'm just using a fixed example for each but in practice df_devices would be updated just after the first for loop for each user.

Another thing to note is that the data= from df_data - would in reality be a function call to my cloud services to extract the needed data. df_data just emulates this.

The implementation I have above works fine - however I am not too fond of it and feel that it's a bit messy and/or inefficient.

Also from a performance point of view df_outer in reality is a huge dataframe with 1000s of expected rows so the quicker the execution the better.

I was wondering if there is a cleaner way to implement something like this?

Ossz
  • 314
  • 1
  • 10

1 Answers1

1

The lines df_backend.loc[df_backend['devices']==row_inner['devices'], 'data'].iloc[0] are clearly not efficient. Firstly, because Pandas iteration is very slow and should really be prohibited. Secondly, because df_backend['devices']==row_inner['devices'] iterates over all the rows of df_backend and creates a temporary boolean array resulting in a wort algorithmic complexity.

Instead, you should use tmp = pd.merge(df_devices, df_backend, on='devices') in the first loop so not to have to lookup the value from another dataframe. More generally, please use vectorized calls of Pandas/Numpy as much as possible. Iterating over rows should be the last option to do.

The code should look like the following without the print/comments:

for index_outer, row_outer in df_outer.iterrows():
    tmp = pd.merge(df_devices, df_backend, on='devices')

    for index_inner, row_inner in tmp.iterrows():
        found = check_dates_containment(row_inner['start_date'], row_inner['end_date'], row_outer['start_date'], row_outer['end_date'])

        if found == True:
            data = row_inner['data']
            break

        if index_inner == len(df_devices)-1:
            final_data = list(tmp['data'])

Note that pd.merge is slow for small dataframe. In fact, the overhead of Pandas and Numpy functions (and more generally Python) is huge for small objects regarding they size.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Thanks - you mentioned in your second paragraph to avoid iterating over dataframes and to use Pandas/Numpy vectorized calls - I agree with this and really wanted to avoid using any for loops as much as possible. Do you see anyway to vectorize this whole implementation instead? I was struggling to see another way to do it outside of for loops – Ossz Feb 18 '22 at 21:35
  • You can for example compare columns directly in `check_dates_containment` instead of items. That being said the `break` is quite problematic here: regarding your input data it may or may not be easy to partially vectorize this pact. I do not fully understand what the code is meant to compute yet. For the final `if` it can be moved outside the second loop. Having objects in dataframe is not great for vectorization. For example lists should likely be replaced with many rows. Vectorization sometime require to completely rethink the approach and sometime even the input/output format. – Jérôme Richard Feb 18 '22 at 22:30