0

I have an pandas.DataFrame like that:

expl = {'label':['A','A','A','B','B','B','C','C','C','C','C','A','A'],
'predicted':[1,1,3,2,2,1,3,2,3,4,4,-1,-2]}

        }
myDf = DataFrame(expl)

I wanna to create a new column with the values on 'predicted' column but "renamed" with the most frequently values on 'predicted' from each 'label'.

I'm currently using the following code which works as expected:

for group in myDf['predicted'].unique():
    if group >=0:
        dfg  = myDf.loc[myDf['predicted']==group, ['label']]
        rename = myDf.loc[myDf['predicted']==group, ['label']].groupby(['label']).size().idxmax()
        myDf.loc[myDf['predicted']==group,'renamed']= rename
    else:
        myDf.loc[myDf['predicted']==group,'renamed']= 'NA'

But I don't think it's the best way to do this. Is there any way to do this better?(in speed or memory)

I also think in save the most frequently values on a dict and use myDf['predict'].apply(lambda v: d[v] ) to replace the values, but doesn't seem to be faster.

Note: Negative values will be considered Not Available('NA').

Expected Output:

    label   predicted   renamed
0   A   1   A
1   A   1   A
2   A   3   C
3   B   2   B
4   B   2   B
5   B   1   A
6   C   3   C
7   C   2   B
8   C   3   C
9   C   4   C
10  C   4   C
11  A   -1  NA
12  A   -2  NA 
L.Stefan
  • 341
  • 1
  • 14
  • https://stackoverflow.com/questions/48590268/pandas-get-the-most-frequent-values-of-a-column/48590361 – politinsa Aug 09 '21 at 12:51
  • @politinsa my problem not is just on get the most frequently, but how to create the new column with this values with in a good performance. It's not the same problem as the stack you sent – L.Stefan Aug 09 '21 at 13:07
  • Then look at most common value with groupby: https://stackoverflow.com/questions/15222754/groupby-pandas-dataframe-and-select-most-common-value – politinsa Aug 09 '21 at 13:11

1 Answers1

1

Try groupby()+transform() and calculate the mode and use mask() to create NaN where 'predicted' is negative number:

myDf['renamed']=(myDf.groupby('predicted')['label']
                    .transform(lambda x:x.mode().iloc[0])
                    .mask(myDf['predicted'].le(-1)))
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
  • Nice, that way perform at least 10 time faster in my test case. Thank you, I will study these methods further. – L.Stefan Aug 09 '21 at 13:40