I have written two functions that are nested and that extend one dataframe (df1) by filtering another dataframe (df2) and transforming it into a list with some logic. The example is of course only a very small one. df1 and df2 are in reality much bigger. Since this procedure takes a lot of time for many rows, I would like to optimize the script in terms of performance. So that the functions themselves work as fast as possible and can be parallelized afterwards. I have already run a parallelization with Swifter. However, this no longer works somehow. I guess Swifter is not the optimal module for this?
Here are the dataframes:
df1 = pd.DataFrame({'name':['10004', '20005', '10003', 'X2'],
'group':['1', '2', '3', 'X2'],
'code':['H', 'H', 'H', 'R'],
'start':[2, 3, 5, 2],
'end':[5, 8, 8, 5] })
df2 = pd.DataFrame({'name': 5*['10004'] + 10*['20005'] + 8*['10003'] + 6*['X2'],
'group':5*['1'] + 10*['2'] + 8*['3'] + 6*['X2'],
'code': 5*['H'] + 10*['H'] + 8*['H'] + 6*['R'],
'ID':list(range(1,6)) +
list(range(1,11)) +
list(range(1,9)) +
list(range(1,7)),
'ConcFZ':['1', '1,2' , '', '3', '4',
'3,4', '3', '3', '2', '', '2', '', '2,1', '1', '1',
'8', '5','6', '', '6', '', '2', '2',
'3', '3', '3,2,1', '2', '2', '1'],
'NumFZ':[1, 2 , 0, 1, 1,
2, 1, 1, 1, 0, 1, 0, 2, 1, 1,
1, 1,1, 0, 1, 0, 1, 1,
1, 1, 3, 1, 1, 1]})
And the Functions:
def Filter_df(row, counter=0):
df_filtered = df2[df2['name'].isin([row['name']])&
df2['group'].isin([row['group']])&
df2['code'].isin([row['code']])&
~df2['NumFZ'].isin([0])]\
.set_index('ID')\
.loc[row['start']:row['end']]\
.drop_duplicates(subset='ConcFZ', keep='last')[['ConcFZ', 'NumFZ']]
if df_filtered.size == 0:
print('No Data at Index:', row.name)
return []
else:
return TzToList(df_filtered)
def TzToList(df_filtered):
TWTZ = df_filtered[df_filtered['NumFZ'] == 1]['ConcFZ'].astype(int).tolist()
if df_filtered.shape[0] == 1 and df_filtered.iat[0,1] > 1:
tz=[]
tz=[
int(df_filtered['ConcFZ'].str.split(',').iat[0][f])
for f in range(0, len(df_filtered['ConcFZ'].str.split(',').iat[0][:]))
]
tz.sort
TWTZ.append(tz[0])
elif df_filtered.shape[0] == 1 and df_filtered.iat[0,1] == 1:
pass
elif df_filtered.iat[0,1] == 0:
print('LRILred.iat[0,1] == 0?: ', df_filtered.iat[0,1])
else:
df_filtered_g1 = df_filtered[df_filtered['NumFZ'] >1]
for i in range(0, df_filtered_g1.shape[0]):
tz=[]
tz=[
int(df_filtered_g1['ConcFZ'].str.split(',').iat[i][f])
for f in range(0, len(df_filtered_g1['ConcFZ'].str.split(',').iat[i][:]))
]
tz.sort
if len(list(set(tz).intersection(TWTZ))) == 0:
TWTZ.append(tz[0])
else:
continue
return TWTZ
As you can see, the function "Filter_df" uses some row values from df1 to filter df2 and returns the output of function TzToList. TzToList takes the filtered df, simplifies this data even further, and converts the result into a list.This list is to be added to df1 as a list column.
I do this like this:
df1['Filtered'] = df1.apply(Filter_df, axis=1)
My python version is: 3.9.13 My pandas version is: 1.5.2 and I use this script in a jupyter notebook with jupyter-lab
Here is the first version of the Filtered_df function that was slower than the one above:
def Filter_df_1(row, counter=0):
df_filtered = df2[(df2['name']==row['name'])&
(df2['group']==row['group'])&
(df2['code']==row['code'])&
(df2['NumFZ']!=0)]\
.set_index('ID')\
.loc[row['start']:row['end']]\
.drop_duplicates(subset='ConcFZ', keep='last')[['ConcFZ', 'NumFZ']]
if df_filtered.size==0:
print('No Data at Index:', row.name)
return []
else:
return TzToList(df_filtered)
The parallelization worked with Filter_df_1 and swifter on win10 as well as by using pandarallel on my Linux system. Somehow it does not work with swifter any more.
import swifter
df1['Filtered'] = df1.swifter.apply(Filter_df, axis=1)
Anyway I need that procedure run on Win10 with a 32core CPU and 64 Threads. What would be the best Module to use? Dask? And How to use it?