4

Suppose I have the following df:

df=pd.DataFrame({'g1':['a','b','c'],
                'g2':['x','y','z'],
                'val':[1,2,3]})
df
g1  g2  val
a   x   1
b   y   2
c   z   3       

whose 'primary key' (using SQL's jargon here) is g1 and g2.

How do I expand the dataframe such that it has a row for every combination of the existing values in g1 and g2. For the rows that are added, leaving np.nan under the val column is fine. So the result will look like the following:

g1  g2  val
a   x   1.0
a   y   NaN
a   z   NaN
b   x   NaN
b   y   2.0
b   z   NaN
c   x   NaN
c   y   NaN
c   z   3.0

I am currently doing the following, but I am sure it is not the best solution:

df.set_index(['g1','g2']).T.stack().unstack().T.reset_index()

I can also do merges, but when the number of keys is large the code looks cumbersome.

GoCurry
  • 899
  • 11
  • 31

2 Answers2

4

Use @senderle's fast cartesian_product for performance:

v = cartesian_product(df.g1, df.g2)
idx = pd.MultiIndex.from_arrays([v[:, 0], v[:, 1]])

df.set_index(['g1', 'g2']).reindex(idx)
     val
a x  1.0
  y  NaN
  z  NaN
b x  NaN
  y  2.0
  z  NaN
c x  NaN
  y  NaN
  z  3.0

%timeit df.set_index(['g1','g2']).T.stack().unstack().T
%%timeit
v = cartesian_product(df.g1, df.g2)
idx = pd.MultiIndex.from_arrays([v[:, 0], v[:, 1]])
df.set_index(['g1', 'g2']).reindex(idx)

14.6 ms ± 840 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.56 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thank you. This is very powerful. However the `cartesian_product` function seems to mess up the data type when g1 is dt and g2 is str for example. It takes many manual processing afterwards. – GoCurry Aug 31 '18 at 01:15
  • 1
    Also, we want to pass `df.g1.unique()` and `df.g2.unique()` to `cartesian_product`. – GoCurry Aug 31 '18 at 01:16
  • @GoCurry Thanks for the suggestion. You're right. I'm not sure how to fix that though. Perhaps convert the datetime to stretch first and then convert it back later? – cs95 Aug 31 '18 at 03:58
0

Answering my own question. The following method should be much faster than the one in my question, but slower than @coldspeed's method. It can be a good alternative when data types of the levels of the index are not the same.

df.set_index(['g1', 'g2']).reindex(pd.MultiIndex.from_tuples(itertools.product(df.g1, df.g2)))

A rough speed comparison:

  • coldspeed: 100%
  • this: 135%
  • the one in the question: 270%
GoCurry
  • 899
  • 11
  • 31