12

I have a Pandas dataframe that has two key columns, and I want to ensure that the Cartesian product of those keys exist in the table (because I'll have to make a 2D plot containing all combinations). I'm having trouble coming up with a reasonably brief and idiomatic way to do this.

For example, I start with this table giving combinations of fruits and vegetables, and how they taste together:

   combo   fruit      veg
0  tasty   apple   carrot
1  yucky  banana   carrot
2  tasty  banana  lettuce
3  yucky   lemon  lettuce

I want to end up with this table in which all possible combinations occur:

    fruit      veg    combo
0   apple   carrot    tasty
1   apple  lettuce  UNKNOWN
2  banana   carrot    yucky
3  banana  lettuce    tasty
4   lemon   carrot  UNKNOWN
5   lemon  lettuce    yucky

Here's the best way I've found to do it:

import pandas as pd

# Initial data
df=pd.DataFrame(dict(fruit=['apple','banana','banana','lemon'],
                     veg=['carrot','carrot','lettuce','lettuce'],
                     combo=['tasty','yucky','tasty','yucky']))

# Solution starts here
veg=df.groupby('veg').size().reset_index()
fruit=df.groupby('fruit').size().reset_index()
fruit[0] = veg[0] = 0    #use this dummy column for the join to work!
cartesian = pd.merge(fruit, veg, how='outer', on=0)
del cartesian[0]
all_combos = pd.merge(cartesian, df, how='left')
all_combos[ pd.isnull(all_combos.combo) ] = 'UNKNOWN'

I imagine that there's got to be a simpler and less error-prone way to do this... any advice?

I'd especially appreciate it if someone could show me how to do this both with and without a multi-index containing the fruit and veg columns, because I am really stumped about how to do this with indexes. Based on my SQL experience, I'd think these are exactly the situations that indexes are intended for.

Dan Lenski
  • 76,929
  • 13
  • 76
  • 124

1 Answers1

13

Sometime after this answer, I added cartesian_product to pandas, and soon after MultiIndex.from_product was added (following its suggestion in another question). This enables the following simplification which is more efficient:

In [21]: p = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)

In [22]: df1.reindex(p, fill_value='UNKNOWN')
Out[22]:
                  combo
fruit  veg
apple  carrot     tasty
       lettuce  UNKNOWN
banana carrot     yucky
       lettuce    tasty
lemon  carrot   UNKNOWN
       lettuce    yucky

The older answer follows:


If you use fruit and veg as the index, then you could use itertools.product* to create the MultiIndex to reindex by:

In [10]: from itertools import product

In [11]: df
Out[11]:
   combo   fruit      veg
0  tasty   apple   carrot
1  yucky  banana   carrot
2  tasty  banana  lettuce
3  yucky   lemon  lettuce

The tricky part is to grab the right MultiIndex of all the possible fruit/veg:

In [12]: fruit_x_veg = list(product(np.unique(df['fruit']), np.unique(df['veg'])))

In [13]: fruit_x_veg = pd.MultiIndex.from_tuples(fruit_x_veg,
                                                 names=['fruit', 'veg'])

Then you can just reindex by these:

In [14]: df1 = df.set_index(['fruit', 'veg'])

In [15]: df1
Out[15]:
                combo
fruit  veg
apple  carrot   tasty
banana carrot   yucky
       lettuce  tasty
lemon  lettuce  yucky

In [16]: df1.reindex(fruit_x_veg, fill_value='UNKNOWN')
Out[16]:
                  combo
fruit  veg
apple  carrot     tasty
       lettuce  UNKNOWN
banana carrot     yucky
       lettuce    tasty
lemon  carrot   UNKNOWN
       lettuce    yucky

* If itertools.product is not fast enough consider using this numpy implemention

Note: this implementation was extended in the pandas.tools.util.cartesian_product, which now supports more dtypes (and is used under the hood in MultiIndex.from_product).

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Thanks, Andy, this works well for me. Looks like there is also a `fill_value` optional argument to reindex which can save a step (e.g. `df1.reindex(fruit_x_veg, fill_value='UNKNOWN')`. – Dan Lenski Jun 10 '13 at 18:52
  • 1
    @Dan Excellent! oooh, how neat :) – Andy Hayden Jun 10 '13 at 19:01
  • 1
    @Dan FYI there is a recent [github issue](https://github.com/pydata/pandas/issues/3835) about adding this functionality natively. – Andy Hayden Jun 10 '13 at 19:03
  • Note that in current builds of pandas, `cartesian_product` is in `pandas.core.reshape.util` (not `pandas.tools.util`). – BeingQuisitive Jun 02 '17 at 15:13