I have a Pandas dataframe that has two key columns, and I want to ensure that the Cartesian product of those keys exist in the table (because I'll have to make a 2D plot containing all combinations). I'm having trouble coming up with a reasonably brief and idiomatic way to do this.
For example, I start with this table giving combinations of fruits and vegetables, and how they taste together:
combo fruit veg
0 tasty apple carrot
1 yucky banana carrot
2 tasty banana lettuce
3 yucky lemon lettuce
I want to end up with this table in which all possible combinations occur:
fruit veg combo
0 apple carrot tasty
1 apple lettuce UNKNOWN
2 banana carrot yucky
3 banana lettuce tasty
4 lemon carrot UNKNOWN
5 lemon lettuce yucky
Here's the best way I've found to do it:
import pandas as pd
# Initial data
df=pd.DataFrame(dict(fruit=['apple','banana','banana','lemon'],
veg=['carrot','carrot','lettuce','lettuce'],
combo=['tasty','yucky','tasty','yucky']))
# Solution starts here
veg=df.groupby('veg').size().reset_index()
fruit=df.groupby('fruit').size().reset_index()
fruit[0] = veg[0] = 0 #use this dummy column for the join to work!
cartesian = pd.merge(fruit, veg, how='outer', on=0)
del cartesian[0]
all_combos = pd.merge(cartesian, df, how='left')
all_combos[ pd.isnull(all_combos.combo) ] = 'UNKNOWN'
I imagine that there's got to be a simpler and less error-prone way to do this... any advice?
I'd especially appreciate it if someone could show me how to do this both with and without a multi-index containing the fruit
and veg
columns, because I am really stumped about how to do this with indexes. Based on my SQL experience, I'd think these are exactly the situations that indexes are intended for.