Pandas - speed up the mode computation when applying a mask

Question

I am looking to compute the mode over a dataframe that I previously filter with a mask. To explain the problem, below a sample of how the data look like:

ID,MASK,VALUE
1,[2,3],4
2,[4,1],2
3,[],2
4,[2],3

The result that I would like to obtain is the following:

ID,MASK,VALUE,VALUE_M
1,[2,3],4,2
2,[4,1],2,3
3,[],2,-1
4,[2],3,2

When the mode cannot be determined I would like to have the lowest number. When no MASK is defined, the value will be -1.

The code that I am using now is the following:

for index,row in df.iterrows():
    mask= row['MASK']
    if len(mask)>0:
        df.loc[index,'VALUE_M'] = df.loc[df['ID'].isin(MASK),'VALUE'].value_counts().index[0]
    else: 
        df.loc[index,'VALUE_M'] = -1

As you can see I am cycling over each row, which is highly unrecommended when using pandas, especially when there are a lot of rows (which is my case). I am looking for a more optimized way to obtain the result.

Any idea?

Sorry, I forgot to modify every call to train. Train is df. I will edit the question. — Guido Muscioni, Jun 26 '19 at 22:05

score 1 · Accepted Answer · answered Jun 26 '19 at 22:25

unnesting first then merge with original df, then we can do crosstab with idxmax

s=unnesting(df,['MASK'],axis=1).merge(df[['ID','VALUE']],left_on='MASK',right_on='ID')
pd.crosstab(s.ID_x,s.VALUE_y).idxmax(1).reindex(df.ID,fill_value=-1)
Out[268]: 
ID
1    2
2    3
3   -1
4    2
dtype: int64

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx
        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')

score 0 · Answer 2 · answered Jun 26 '19 at 22:26

I'm operating under the assumption that your MASK column is stored as a series of str, therefore I am converting to a list inside of the find_mode function prior to using the isin() method. Not sure if this is faster than @WeNYoBen's answer, but I think it's quite a bit simpler to understand:

def find_mode(row):

    mask = ast.literal_eval(row['MASK'])

    return df[df['ID'].isin(mask)]['VALUE'].mode().values[0] if mask else -1

df['VALUE_M'] = df.apply(find_mode, axis=1)

Yields:

   ID   MASK  VALUE  VALUE_M
0   1  [2,3]      4        2
1   2  [4,1]      2        3
2   3     []      2       -1
3   4    [2]      3        2

Pandas - speed up the mode computation when applying a mask

2 Answers2