4

I have a dataframe that records the number and type of fruits owned by various people. I'd like to add a column that indicates the top fruit(s) for each person. If a person has 2+ top-ranking fruits (aka, a tie), I want a list (or tuple) of them all.

Input

For example, let's say my input is this dataframe:

# Create all the fruit data
data = [{'fruit0':'strawberry','fruit0_count':23,'fruit1':'orange','fruit1_count':4,'fruit2':'grape','fruit2_count':27},
                  {'fruit0':'apple','fruit0_count':45,'fruit1':'mango','fruit1_count':45,'fruit2':'orange','fruit2_count':12},
                  {'fruit0':'blueberry','fruit0_count':30,'fruit1':'grapefruit','fruit1_count':32,'fruit2':'cherry','fruit2_count':94},
                  {'fruit0':'pineapple','fruit0_count':4,'fruit1':'grape','fruit1_count':4,'fruit2':'lemon','fruit2_count':67}]

# Add people's names as an index 
df = pd.DataFrame(data, index=['Shawn', 'Monica','Jamal','Tracy'])

# Print the dataframe
df

. . . which creates the input dataframe:

        fruit0      fruit0_count    fruit1      fruit1_count    fruit2  fruit2_count
Shawn   strawberry  23              orange      4               grape   27
Monica  apples      45              mango       45              orange  12
Jamal   blueberry   30              grapefruit  32              cherry  94
Tracy   pineapple   4               grape       4               lemon   67

Target output

What I'd like to get is a new column that gives the name of the top fruit for each person. If the person has two (or more) fruits that tied for first, I'd like a list or a tuple of those fruits:

        fruit0      fruit0_count    fruit1      fruit1_count    fruit2  fruit2_count    top_fruit
Shawn   strawberry  23              orange      4               grape   27              grape
Monica  apple       45              mango       45              orange  12              (apple,mango)
Jamal   blueberry   30              grapefruit  32              cherry  94              cherry
Tracy   pineapple   4               grape       4               lemon   67              lemon

My attempt far

The closest I've gotten is based on https://stackoverflow.com/a/38955365/6480859.

Problems:

  1. If there is a tie for top fruit, it only captures one top fruit (Monica's top fruit is only apple.)
  2. It's really complicated. Not really a problem, but if there is a more straightforward path, I'd like to learn it.
# List the columns that contain count numbers
cols = ['fruit0_count', 'fruit1_count', 'fruit2_count']

# Make a new dataframe with just those columns.
only_counts_df=pd.DataFrame()
only_counts_df[cols]=df[cols].copy()

# Indicate how many results you want. Note: If you increase
# this from 1, it gives you the #2, #3, etc. ranking -- it 
# doesn't represent tied results.
nlargest = 1 

# The next two lines are suggested from 
# https://stackoverflow.com/a/38955365/6480859. I don't totally
# follow along . . . 
order = np.argsort(-only_counts_df.values, axis=1)[:, :nlargest]
result = pd.DataFrame(only_counts_df.columns[order], 
                      columns=['top{}'.format(i) for i in range(1, nlargest+1)],
                      index=only_counts_df.index)

# Join the results back to our original dataframe
result = df.join(result).copy()

# The dataframe now reports the name of the column that 
# contains the top fruit. Convert this to the fruit name.
def id_fruit(row):
    if row['top1'] == 'fruit0_count':
        return row['fruit0']
    elif row['top1'] == 'fruit1_count':
        return row['fruit1']
    elif row['top1'] == 'fruit2_count':
        return row['fruit2']
    else:
        return "Failed"
result['top_fruit'] = result.apply(id_fruit,axis=1)
result = result.drop(['top1'], axis=1).copy()
result

. . . which outputs:

        fruit0      fruit0_count    fruit1      fruit1_count    fruit2  fruit2_count    top_fruit
Shawn   strawberry  23              orange      4               grape   27              grape
Monica  apple       45              mango       45              orange  12              apple
Jamal   blueberry   30              grapefruit  32              cherry  94              cherry
Tracy   pineapple   4               grape       4               lemon   67              lemon

Monica's top fruit should be apple and mango.

Any tips are welcome, thanks!

relizt
  • 375
  • 2
  • 11

2 Answers2

3

Idea is filter each pair and unpair column to df1 and df2, then compare values by max and filter with DataFrame.mask, last get non missing values in apply:

df1 = df.iloc[:, ::2]
df2 = df.iloc[:, 1::2]
mask = df2.eq(df2.max(axis=1), axis=0)

df['top'] = df1.where(mask.to_numpy()).apply(lambda x: x.dropna().tolist(), axis=1)
print (df)
            fruit0  fruit0_count      fruit1  fruit1_count  fruit2  \
Shawn   strawberry            23      orange             4   grape   
Monica       apple            45       mango            45  orange   
Jamal    blueberry            30  grapefruit            32  cherry   
Tracy    pineapple             4       grape             4   lemon   

        fruit2_count             top  
Shawn             27         [grape]  
Monica            12  [apple, mango]  
Jamal             94        [cherry]  
Tracy             67         [lemon]  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

Here's what I've come up with:

maxes = df[[f"fruit{i}_count" for i in range(3)]].max(axis=1)
mask = df[[f"fruit{i}_count" for i in range(3)]].isin(maxes)
df_masked = df[[f"fruit{i}" for i in range(3)]][
    mask.rename(lambda x: x.replace("_count", ""), axis=1)
]

df["top_fruit"] = df_masked.apply(lambda x: x.dropna().tolist(), axis=1)

This will return

            fruit0  fruit0_count  ... fruit2_count       top_fruit
Shawn   strawberry            23  ...           27         [grape]
Monica       apple            45  ...           12  [apple, mango]
Jamal    blueberry            30  ...           94        [cherry]
Tracy    pineapple             4  ...           67         [lemon]
ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65