-1

I have a problem where I have to find the 'DETAILS' that appear the most number of times(count of 'ID') for a particular 'UNIT' and 'FUNCTION'

so in example below for UNIT = 0011, FUNCTION = 0001, most number of 'ID' are associated with DETAILS = 546545153113131. similiarly - for UNIT = 0011, FUNCTION = 0026, most number of 'ID' are associated with DETAILS = 546545153113132. for UNIT = 0012, FUNCTION = 0001, most number of 'ID' are associated with DETAILS = 546545153113131. for UNIT = 0012, FUNCTION = 0026, most number of 'ID' are associated with DETAILS = 546545153113132.

Dataframe here is quite large containing 5 Million rows. I am reading this data from a text file in a dataframe.

import pandas as pd
  
# initialize list of lists
data = [['000000000001', '0011','0001','546545153113131'],    
        ['000000000002', '0011','0026','546545153113132'],        
        ['000000000004', '0011','0001','546545153113133'],
        ['000000000005', '0011','0001','546545153113131'],    
        ['000000000006', '0012','0001','546545153113131'],    
        ['000000000007', '0012','0026','546545153113132'],
        ['000000000009', '0012','0001','546545153113133'],
        ['000000000010', '0012','0001','546545153113131']

         ]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['ID', 'UNIT', 'FUNCTION', 'DETAILS'])
df.sort_values(by=['ID', 'UNIT', 'FUNCTION'], ascending=[True,True,True])
df.groupby(['UNIT', 'FUNCTION','DETAILS']).count()

above piece of code gave me this-

                            ID
UNIT    FUNCTION    DETAILS 
0011    0001     546545153113131    2
                 546545153113133    1
        0026     546545153113132    1
0012    0001     546545153113131    2
                 546545153113133    1
        0026     546545153113132    1

what I am expecting is dataframe that can give me this -

                            
UNIT    FUNCTION    DETAILS        ID
0011    0001     546545153113131    2
0011    0026     546545153113132    1
0012    0001     546545153113131    2
0012    0026     546545153113132    1

I tried few solutions mentioned - GroupBy pandas DataFrame and select most common value , didn't worked. I am a Python beginner and apologies if what I had written does not makes much sense.

1 Answers1

0

I'd suggest to group again by ['UNIT', 'FUNCTION'] and keep only the first row where ID has the max value:

import pandas as pd
  
data = [['000000000001', '0011','0001','546545153113131'],    
        ['000000000002', '0011','0026','546545153113132'],        
        ['000000000004', '0011','0001','546545153113133'],
        ['000000000005', '0011','0001','546545153113131'],    
        ['000000000006', '0012','0001','546545153113131'],    
        ['000000000007', '0012','0026','546545153113132'],
        ['000000000009', '0012','0001','546545153113133'],
        ['000000000010', '0012','0001','546545153113131']

         ]

df = pd.DataFrame(data, columns=['ID', 'UNIT', 'FUNCTION', 'DETAILS'])
df.sort_values(by=['ID', 'UNIT', 'FUNCTION'], ascending=[True,True,True])

df = df.groupby(['UNIT', 'FUNCTION','DETAILS'], as_index=False).count()

df_out = pd.concat([
    sub_df[sub_df['ID']==sub_df['ID'].max()].iloc[0:1]
    for _, sub_df in df.groupby(['UNIT', 'FUNCTION'])
])

Output:

   UNIT FUNCTION          DETAILS  ID
0  0011     0001  546545153113131   2
2  0011     0026  546545153113132   1
3  0012     0001  546545153113131   2
5  0012     0026  546545153113132   1
Tranbi
  • 11,407
  • 6
  • 16
  • 33