Performant cartesian product (CROSS JOIN) with pandas

Question

The contents of this post were originally meant to be a part of Pandas Merging 101, but due to the nature and size of the content required to fully do justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

left

  col1  col2
0    A     1
1    B     2
2    C     3

right

  col1  col2
0    X    20
1    Y    30
2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20
A       1      Y      30
A       1      Z      50
B       2      X      20
B       2      Y      30
B       2      Z      50
C       3      X      20
C       3      Y      30
C       3      Z      50

What is the most performant method of computing this result?

Would you like share your input in Github as well , I think adding `cross join` in pandas is really good to match all the join function in SQL . https://github.com/pandas-dev/pandas/issues/5401 — BENY, Dec 11 '18 at 20:58

cs95 · Accepted Answer · 2022-07-23T10:47:22.483

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

pandas <= 1.1.X

def cartesian_product_basic(left, right):
    return (
       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)

pandas >= 1.2

left.merge(right, how="cross") # implements the technique above

  col1_x  col2_x col1_y  col2_y
0      A       1      X      20
1      A       1      Y      30
2      A       1      Z      50
3      B       2      X      20
4      B       2      Y      30
5      B       2      Z      50
6      C       3      X      20
7      C       3      Y      30
8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(np.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your own risk!

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

def cartesian_product_generalized(left, right):
    la, lb = len(left), len(right)
    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
    return pd.DataFrame(
        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left, right))
True

And, along similar lines,

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']
    

left2
   col1  col2
s1    A     1
s2    B     2
s1    C     3

right2
  col1  col2
x    X    20
y    Y    30
y    Z    50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):
    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
    return pd.DataFrame(
        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

   0  1  2   3  4  5
0  A  1  X  20  A  1
1  A  1  X  20  B  2
2  A  1  X  20  C  3
3  A  1  X  20  D  4
4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
               cartesian_product_basic(left2, right2))
True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Performance Benchmarking Code
This is the timing script. All functions called here are defined above.

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['cartesian_product_basic', 'cartesian_product_generalized', 
              'cartesian_product_multi', 'cartesian_product_simplified'],
       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        # print(f,c)
        left2 = pd.concat([left] * c, ignore_index=True)
        right2 = pd.concat([right] * c, ignore_index=True)
        stmt = '{}(left2, right2)'.format(f)
        setp = 'from __main__ import left2, right2, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

_{* you are here}

Why do the column names become integers? When I attempt to rename them, `.rename()` runs, but the integers remain. — OverflowingTheGlass, Jan 11 '19 at 21:14
@CameronTaylor did you forget to call rename with axis=1 argument? — cs95, Jan 11 '19 at 21:14
nope...even more dense - I put quotes around the integers - thank you — OverflowingTheGlass, Jan 11 '19 at 21:16
another question. I'm using cartesian_product_simplified, and I'm (predictably) running out of memory when I try to join a 50K row df to a 30K row df. Any tips on getting over the memory issue? — OverflowingTheGlass, Jan 11 '19 at 21:54
@CameronTaylor do the other cartesian_product_* functions also throw a memory error? I would guess you could use cartesian_product_multi here. — cs95, Jan 11 '19 at 22:26
`cartesian_product_multi` is working for me, but all of my uint8 columns seem to have become floats. I can use astype('uint8') to get them back to the smaller data type, but for large datasets, this adds a lot of time. Is there a way for `cartesian_product_multi` to preserve column data types? — Nimai, May 14 '20 at 03:00
Unfortunately when joining on frames with mixed data types, it may not be possible to preserve the type of the original columns. I've tried to highlight this difference in the answer. — cs95, Jan 02 '21 at 01:22
FYI my friend , https://pandas.pydata.org/docs/reference/api/pandas.merge.html, now have cross ~ :-) — BENY, Mar 30 '21 at 13:43
I'm not following how `left.merge(right, how="cross")` plays into this. Obviously this answer was written before cross merge was added natively to Pandas, but then you edited in the native version later, so... is answer still relevant? that is, is the native version still slow? It's not included in the benchmarks and you only mention is once. — wjandrea, Jun 05 '22 at 19:38
@wjandrea the "cross" method internally implements `cartesian_product_basic` which is why I didn't include it again. — cs95, Jul 23 '22 at 10:46

BENY · Answer 2 · 2021-06-29T15:03:03.790

After pandas 1.2.0 merge now have option cross

left.merge(right, how='cross')

Using itertools product and recreate the value in dataframe

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

Bharath M Shetty · Answer 3 · 2018-12-10T14:00:30.243

Here's an approach with triple concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

    col1  col2 col1  col2
0     A     1    X    20
1     A     1    Y    30
2     A     1    Z    50
3     B     2    X    20
4     B     2    Y    30
5     B     2    Z    50
6     C     3    X    20
7     C     3    Y    30
8     C     3    Z    50

score 0 · Answer 4 · answered Jun 30 '22 at 00:18

One option is with expand_grid from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor as jn

others = {'left':left, 'right':right}

jn.expand_grid(others = others)

  left      right
  col1 col2  col1 col2
0    A    1     X   20
1    A    1     Y   30
2    A    1     Z   50
3    B    2     X   20
4    B    2     Y   30
5    B    2     Z   50
6    C    3     X   20
7    C    3     Y   30
8    C    3     Z   50

abhygag · Answer 5 · 2021-11-17T05:39:03.223

-1

I think the simplest way would be to add a dummy column to each data frame, do an inner merge on it and then drop that dummy column from the resulting cartesian dataframe:

left['dummy'] = 'a'
right['dummy'] = 'a'

cartesian = left.merge(right, how='inner', on='dummy')

del cartesian['dummy']

edited Nov 17 '21 at 05:39

answered Nov 16 '21 at 18:17

abhygag

1
2

this was already discussed in the accepted answer. But now `left.merge(right, how="cross")` already does this without the need for a second column. – cs95 Nov 19 '21 at 09:43
Somehow cross didn't work for me. May be version issue. – abhygag Nov 20 '21 at 10:05

Performant cartesian product (CROSS JOIN) with pandas

5 Answers5

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Further Simplification

Performance Comparison

Continue Reading

Linked

Related