I have a dataframe that records the number and type of fruits owned by various people. I'd like to add a column that indicates the top fruit(s) for each person. If a person has 2+ top-ranking fruits (aka, a tie), I want a list (or tuple) of them all.
Input
For example, let's say my input is this dataframe:
# Create all the fruit data
data = [{'fruit0':'strawberry','fruit0_count':23,'fruit1':'orange','fruit1_count':4,'fruit2':'grape','fruit2_count':27},
{'fruit0':'apple','fruit0_count':45,'fruit1':'mango','fruit1_count':45,'fruit2':'orange','fruit2_count':12},
{'fruit0':'blueberry','fruit0_count':30,'fruit1':'grapefruit','fruit1_count':32,'fruit2':'cherry','fruit2_count':94},
{'fruit0':'pineapple','fruit0_count':4,'fruit1':'grape','fruit1_count':4,'fruit2':'lemon','fruit2_count':67}]
# Add people's names as an index
df = pd.DataFrame(data, index=['Shawn', 'Monica','Jamal','Tracy'])
# Print the dataframe
df
. . . which creates the input dataframe:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count
Shawn strawberry 23 orange 4 grape 27
Monica apples 45 mango 45 orange 12
Jamal blueberry 30 grapefruit 32 cherry 94
Tracy pineapple 4 grape 4 lemon 67
Target output
What I'd like to get is a new column that gives the name of the top fruit for each person. If the person has two (or more) fruits that tied for first, I'd like a list or a tuple of those fruits:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count top_fruit
Shawn strawberry 23 orange 4 grape 27 grape
Monica apple 45 mango 45 orange 12 (apple,mango)
Jamal blueberry 30 grapefruit 32 cherry 94 cherry
Tracy pineapple 4 grape 4 lemon 67 lemon
My attempt far
The closest I've gotten is based on https://stackoverflow.com/a/38955365/6480859.
Problems:
- If there is a tie for top fruit, it only captures one top fruit (Monica's top fruit is only apple.)
- It's really complicated. Not really a problem, but if there is a more straightforward path, I'd like to learn it.
# List the columns that contain count numbers
cols = ['fruit0_count', 'fruit1_count', 'fruit2_count']
# Make a new dataframe with just those columns.
only_counts_df=pd.DataFrame()
only_counts_df[cols]=df[cols].copy()
# Indicate how many results you want. Note: If you increase
# this from 1, it gives you the #2, #3, etc. ranking -- it
# doesn't represent tied results.
nlargest = 1
# The next two lines are suggested from
# https://stackoverflow.com/a/38955365/6480859. I don't totally
# follow along . . .
order = np.argsort(-only_counts_df.values, axis=1)[:, :nlargest]
result = pd.DataFrame(only_counts_df.columns[order],
columns=['top{}'.format(i) for i in range(1, nlargest+1)],
index=only_counts_df.index)
# Join the results back to our original dataframe
result = df.join(result).copy()
# The dataframe now reports the name of the column that
# contains the top fruit. Convert this to the fruit name.
def id_fruit(row):
if row['top1'] == 'fruit0_count':
return row['fruit0']
elif row['top1'] == 'fruit1_count':
return row['fruit1']
elif row['top1'] == 'fruit2_count':
return row['fruit2']
else:
return "Failed"
result['top_fruit'] = result.apply(id_fruit,axis=1)
result = result.drop(['top1'], axis=1).copy()
result
. . . which outputs:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count top_fruit
Shawn strawberry 23 orange 4 grape 27 grape
Monica apple 45 mango 45 orange 12 apple
Jamal blueberry 30 grapefruit 32 cherry 94 cherry
Tracy pineapple 4 grape 4 lemon 67 lemon
Monica's top fruit should be apple and mango.
Any tips are welcome, thanks!