0

I am looking to compute the mode over a dataframe that I previously filter with a mask. To explain the problem, below a sample of how the data look like:

ID,MASK,VALUE
1,[2,3],4
2,[4,1],2
3,[],2
4,[2],3

The result that I would like to obtain is the following:

ID,MASK,VALUE,VALUE_M
1,[2,3],4,2
2,[4,1],2,3
3,[],2,-1
4,[2],3,2

When the mode cannot be determined I would like to have the lowest number. When no MASK is defined, the value will be -1.

The code that I am using now is the following:

for index,row in df.iterrows():
    mask= row['MASK']
    if len(mask)>0:
        df.loc[index,'VALUE_M'] = df.loc[df['ID'].isin(MASK),'VALUE'].value_counts().index[0]
    else: 
        df.loc[index,'VALUE_M'] = -1

As you can see I am cycling over each row, which is highly unrecommended when using pandas, especially when there are a lot of rows (which is my case). I am looking for a more optimized way to obtain the result.

Any idea?

Guido Muscioni
  • 1,203
  • 3
  • 15
  • 37

2 Answers2

1

unnesting first then merge with original df, then we can do crosstab with idxmax

s=unnesting(df,['MASK'],axis=1).merge(df[['ID','VALUE']],left_on='MASK',right_on='ID')
pd.crosstab(s.ID_x,s.VALUE_y).idxmax(1).reindex(df.ID,fill_value=-1)
Out[268]: 
ID
1    2
2    3
3   -1
4    2
dtype: int64

def unnesting(df, explode, axis):
    if axis==1:
        idx = df.index.repeat(df[explode[0]].str.len())
        df1 = pd.concat([
            pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
        df1.index = idx
        return df1.join(df.drop(explode, 1), how='left')
    else :
        df1 = pd.concat([
                         pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
        return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234
0

I'm operating under the assumption that your MASK column is stored as a series of str, therefore I am converting to a list inside of the find_mode function prior to using the isin() method. Not sure if this is faster than @WeNYoBen's answer, but I think it's quite a bit simpler to understand:

def find_mode(row):

    mask = ast.literal_eval(row['MASK'])

    return df[df['ID'].isin(mask)]['VALUE'].mode().values[0] if mask else -1

df['VALUE_M'] = df.apply(find_mode, axis=1)

Yields:

   ID   MASK  VALUE  VALUE_M
0   1  [2,3]      4        2
1   2  [4,1]      2        3
2   3     []      2       -1
3   4    [2]      3        2
rahlf23
  • 8,869
  • 4
  • 24
  • 54