3

I have written two functions that are nested and that extend one dataframe (df1) by filtering another dataframe (df2) and transforming it into a list with some logic. The example is of course only a very small one. df1 and df2 are in reality much bigger. Since this procedure takes a lot of time for many rows, I would like to optimize the script in terms of performance. So that the functions themselves work as fast as possible and can be parallelized afterwards. I have already run a parallelization with Swifter. However, this no longer works somehow. I guess Swifter is not the optimal module for this?

Here are the dataframes:

df1 = pd.DataFrame({'name':['10004', '20005', '10003', 'X2'],
                    'group':['1', '2', '3', 'X2'],
                    'code':['H', 'H', 'H', 'R'],
                    'start':[2, 3, 5, 2],
                    'end':[5, 8, 8, 5] })

df2 = pd.DataFrame({'name': 5*['10004'] + 10*['20005'] + 8*['10003'] + 6*['X2'],
                    'group':5*['1'] +     10*['2'] +     8*['3'] +     6*['X2'],
                    'code': 5*['H'] +     10*['H'] +     8*['H'] +     6*['R'],
                    'ID':list(range(1,6)) + 
                         list(range(1,11)) + 
                         list(range(1,9)) + 
                         list(range(1,7)),
                    'ConcFZ':['1', '1,2' , '', '3', '4', 
                          '3,4', '3', '3', '2', '', '2', '', '2,1', '1', '1',
                          '8', '5','6', '', '6', '', '2', '2',
                          '3', '3', '3,2,1', '2', '2', '1'],
                    'NumFZ':[1, 2 , 0, 1, 1, 
                          2, 1, 1, 1, 0, 1, 0, 2, 1, 1,
                          1, 1,1, 0, 1, 0, 1, 1,
                          1, 1, 3, 1, 1, 1]})

And the Functions:


def Filter_df(row, counter=0):
    df_filtered = df2[df2['name'].isin([row['name']])&
                   df2['group'].isin([row['group']])&
                   df2['code'].isin([row['code']])&
                   ~df2['NumFZ'].isin([0])]\
                    .set_index('ID')\
                    .loc[row['start']:row['end']]\
                    .drop_duplicates(subset='ConcFZ', keep='last')[['ConcFZ', 'NumFZ']] 
    
    if df_filtered.size == 0:
        print('No Data at Index:', row.name)
        return []
    
    else:
        return TzToList(df_filtered)

def TzToList(df_filtered):
    
    TWTZ = df_filtered[df_filtered['NumFZ'] == 1]['ConcFZ'].astype(int).tolist()
            
    if df_filtered.shape[0] == 1 and df_filtered.iat[0,1] > 1: 
        tz=[]
        tz=[
            int(df_filtered['ConcFZ'].str.split(',').iat[0][f]) 
                for f in range(0, len(df_filtered['ConcFZ'].str.split(',').iat[0][:]))
            ]
        tz.sort
        TWTZ.append(tz[0])
                
    elif df_filtered.shape[0] == 1 and df_filtered.iat[0,1] == 1:
        pass
            
    elif df_filtered.iat[0,1] == 0:
        print('LRILred.iat[0,1] == 0?: ', df_filtered.iat[0,1])
            
    else:
        df_filtered_g1 = df_filtered[df_filtered['NumFZ'] >1]
        for i in range(0, df_filtered_g1.shape[0]):
            tz=[]
            tz=[
                int(df_filtered_g1['ConcFZ'].str.split(',').iat[i][f]) 
                for f in range(0, len(df_filtered_g1['ConcFZ'].str.split(',').iat[i][:]))
                ]
            tz.sort
                    
            if len(list(set(tz).intersection(TWTZ))) == 0:          
                    TWTZ.append(tz[0])
                        
            else:
                continue
                
    return TWTZ

As you can see, the function "Filter_df" uses some row values from df1 to filter df2 and returns the output of function TzToList. TzToList takes the filtered df, simplifies this data even further, and converts the result into a list.This list is to be added to df1 as a list column.

I do this like this:

df1['Filtered'] = df1.apply(Filter_df, axis=1)

My python version is: 3.9.13 My pandas version is: 1.5.2 and I use this script in a jupyter notebook with jupyter-lab

Here is the first version of the Filtered_df function that was slower than the one above:

def Filter_df_1(row, counter=0):
    
    df_filtered = df2[(df2['name']==row['name'])&
               (df2['group']==row['group'])&
               (df2['code']==row['code'])&
               (df2['NumFZ']!=0)]\
               .set_index('ID')\
               .loc[row['start']:row['end']]\
               .drop_duplicates(subset='ConcFZ', keep='last')[['ConcFZ', 'NumFZ']]
    
    if df_filtered.size==0:
        print('No Data at Index:', row.name)
        return []
    else:
        return TzToList(df_filtered)

The parallelization worked with Filter_df_1 and swifter on win10 as well as by using pandarallel on my Linux system. Somehow it does not work with swifter any more.

import swifter

df1['Filtered'] = df1.swifter.apply(Filter_df, axis=1)

Anyway I need that procedure run on Win10 with a 32core CPU and 64 Threads. What would be the best Module to use? Dask? And How to use it?

eshirvana
  • 23,227
  • 3
  • 22
  • 38
Rory
  • 53
  • 4

1 Answers1

1

The biggest problem with your current implementation is that it runs in a quadratic execution time (O(n**2)). Indeed, for each row of df1, you travel the whole df2 dataframe. Quadratic algorithms are inefficient on large datasets.

Let me set the record straight: there is no magical module that can make such quadratic algorithm fast. You need to improve the complexity first (quasi-linear algorithms are fine). When there is a better algorithm available, using distributed computing or using multiple cores just waste significantly more precious resources. Even when the complexity is not better, doing less work is better than using more computing resources.

The key not to travel the df2 dataframe for every row is to sort the dataframe (by multiple keys since there is a condition on multiple columns). Then you can perform a binary search on the resulting dataframe. Another strategy is to do a group-by so to pre-split df2 and quickly return the part matching with the selected row. Each dataframe group can be put into a dictionary so to fetch it quickly (see this post for an example with 1 column). Because the condition df2['NumFZ']!=0 is independent of the target row, you can pre-filter the whole df2 dataframe once before doing the group-by/sort. This method reduces the complexity of the pre-filtering from O(len(df)) to O(1) for each row. The pre-computing takes O(len(df)).

The next past could result in a quadratic execution in the worst case but it is unlikely to be the case as long as ranges remain small. The drop_duplicates can be optimized by pre-computing hashes for each list so you only need to compare the lists when their hashes are equal (very rare). The fast removal of duplicates can be done in O(n) time for n items using a dictionary. Sorting is an alternative option typically running in O(n log n) (it should be slower in this case in practice). Bloom filters can be even more efficient in this case when combined with a JIT compiler (see this related post). In practice Pandas should have a relatively good complexity for this last operation but creating a pandas overhead has a significant overhead so it may be better to avoid this (typically by converting data to Numpy first).

I assume the bottleneck was mainly the Filter_df function since the second function should operate on significantly smaller data due to the filtering. That being said, there are few optimization to consider:

Pandas operations are expensive, especially on a whole dataframe, so it is often better to convert native columns to Numpy and filter columns prior to row. For example, df_filtered['ConcFZ'][df_filtered['NumFZ'].to_numpy() == 1].astype(int).tolist() is about 4 times faster than df_filtered[df_filtered['NumFZ'] == 1]['ConcFZ'].astype(int).tolist() on my machine on small dataframes.

The CPython interpreter does not optimize replicated expression. Thus, when an expression is replicated N times, it is recomputed N times. For example, [int(df_filtered['ConcFZ'].str.split(',').iat[0][f]) for f in range(0, len(df_filtered['ConcFZ'].str.split(',').iat[0][:]))] recomputes df_filtered['ConcFZ'].str.split(',').iat[0] N+1 times for no reason while df_filtered['ConcFZ'].str.split(',').iat[0] can be pre-computed once before the generator. Note the [:] is useless and perform an necessary slow copy. This also makes the code more readable and more maintainable (see DRY). In fact this line seems very convoluted to me. You can just write: [int(e) for e in df_filtered['ConcFZ'].str.split(',').iat[0]]. One should also keep in mind not to recompute the split df_filtered_g1.shape[0] times in the last loop.

There are probably more improvement to do but this is already a lot of change and maybe enough to get a fast program.

Finally, note that tz.sort does nothing: you need to add the final () so it call the function sort.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59