cartesian product in pandas

Question

I have two pandas dataframes:

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})

What is the best practice to get their cartesian product (of course without writing it explicitly like me)?

#df1, df2 cartesian product
df_cartesian = DataFrame({'col1':[1,2,1,2],'col2':[3,4,3,4],'col3':[5,5,6,6]})

From pandas 1.2 you will soon be able to use `left.merge(right, how="cross")` and it will work like magic. See this [github PR](https://github.com/pandas-dev/pandas/pull/37864). — cs95, Nov 26 '20 at 07:33
It improves with readability of the question to show the dataframes in print/display format. — Mehdi Golari, Dec 09 '21 at 04:23

Matti John · Accepted Answer · 2021-03-30T16:49:42.573

178

In recent versions of Pandas (>= 1.2) this is built into merge so you can do:

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})    

df1.merge(df2, how='cross')

This is equivalent to the previous pandas < 1.2 answer but is easier to read.

For pandas < 1.2:

If you have a key that is repeated for each row, then you can produce a cartesian product using merge (like you would in SQL).

from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})

merge(df1, df2,on='key')[['col1', 'col2', 'col3']]

Output:

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

See here for the documentation: http://pandas.pydata.org/pandas-docs/stable/merging.html

edited Mar 30 '21 at 16:49

answered Nov 07 '12 at 12:47

Matti John

19,329
7
41
39

13

So to do this properly one has to first find an unused column name, then add dummy columns with that name, merge, and finally drop the column on the result? Creating, as opposed to reading, data with pandas is just a pain – Bananach Oct 10 '19 at 07:31
3

@Bananach woah! Relax my friend, it isn't that bad, they just hadn't gotten to it yet. Remember pandas is still a developing library and they only just released v1 recently. Anyway they are adding support for this in 1.2 inside df.merge(). See [here](https://stackoverflow.com/a/65017552/4909087) for more. – cs95 Dec 16 '20 at 09:08
1

@cs95 thanks, I hadn't noticed this was coming in 1.2. In the future that should be the preferred approach – Matti John Dec 16 '20 at 11:14
If all you want to do is merge two column, you can create df1 and df2 "anonymously" like so: `df[["purple"]].merge(df[["red"]], how="cross")`. Mind the double brackets `[["colname"]]` which makes them DataFrame and not Series. – Samuel Prevost May 11 '21 at 09:55

score 108 · Answer 2 · answered Oct 14 '17 at 11:32

108

Use pd.MultiIndex.from_product as an index in an otherwise empty dataframe, then reset its index, and you're done.

a = [1, 2, 3]
b = ["a", "b", "c"]

index = pd.MultiIndex.from_product([a, b], names = ["a", "b"])

pd.DataFrame(index = index).reset_index()

out:

answered Oct 14 '17 at 11:32

Gijs

10,346
5
27
38

8

I believe this is the most pandas-like way these days for pandas>=0.21 – Shadi Apr 23 '18 at 06:05
9

You have downvotes because you haven't shown how this is going to generalise for anything with more than 1 column. – cs95 Dec 08 '18 at 23:41
This function (https://stackoverflow.com/a/58242079/1840471) generalizes it to an arbitrary number of lists using a dict of args. It's a bit different from the question here, which takes the Cartesian product of two DataFrames (i.e. it's not taking the product of `df1.col1` and `df.col2`). – Max Ghenis Oct 04 '19 at 19:21
1

In fact I don't think `from_product` can be used for this problem. – Max Ghenis Oct 04 '19 at 19:22
@MaxGhenis Don't think that is useful for this case, we aren't talking about cartesian products of multiple arrays, we are talking about 2 or more DataFrames (completely different story). – cs95 Jan 02 '21 at 07:24

score 46 · Answer 3 · answered Oct 23 '17 at 18:15

46

Minimal code needed for this one. Create a common 'key' to cartesian merge the two:

df1['key'] = 0
df2['key'] = 0

df_cartesian = df1.merge(df2, how='outer')

answered Oct 23 '17 at 18:15

A.Kot

7,615
2
22
24

11

+ `df_cartesian = df_cartesian.drop(columns=['key'])` to clean up at the end – StackG Mar 15 '19 at 02:49

Rob Guderian · Answer 4 · 2017-04-18T14:08:48.620

38

This won't win a code golf competition, and borrows from the previous answers - but clearly shows how the key is added, and how the join works. This creates 2 new data frames from lists, then adds the key to do the cartesian product on.

My use case was that I needed a list of all store IDs on for each week in my list. So, I created a list of all the weeks I wanted to have, then a list of all the store IDs I wanted to map them against.

The merge I chose left, but would be semantically the same as inner in this setup. You can see this in the documentation on merging, which states it does a Cartesian product if key combination appears more than once in both tables - which is what we set up.

days = pd.DataFrame({'date':list_of_days})
stores = pd.DataFrame({'store_id':list_of_stores})
stores['key'] = 0
days['key'] = 0
days_and_stores = days.merge(stores, how='left', on = 'key')
days_and_stores.drop('key',1, inplace=True)

edited Apr 18 '17 at 14:08

answered Sep 21 '15 at 16:43

Rob Guderian

528
4
11

33

A bit shorter version: `days_and_stores = pd.merge(days.assign(key=0), stores.assign(key=0), on='key').drop('key', axis=1)` – Eugene Pakhomov Mar 11 '17 at 15:59
You mention crossJoin, but you are using a pandas dataframe, not a spark dataframe. – Bryce Guinta Apr 17 '17 at 04:44
Dang. Wasn't thinking. I use spark + pandas together so often, that when I saw the update to spark I thought about this post. Thanks Bryce. – Rob Guderian Apr 18 '17 at 14:10

score 25 · Answer 5 · answered Mar 16 '18 at 16:53

25

With method chaining:

product = (
    df1.assign(key=1)
    .merge(df2.assign(key=1), on="key")
    .drop("key", axis=1)
)

answered Mar 16 '18 at 16:53

pomber

23,132
10
81
94

cs95 · Answer 6 · 2021-01-02T07:19:45.140

Presenting to you

pandas >= 1.2

`left.merge(right, how='cross')`

import pandas as pd 

pd.__version__
# '1.2.0'

left = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
right = pd.DataFrame({'col3': [5, 6]}) 

left.merge(right, how='cross')

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

Indexes are ignored in the result.

Implementation wise, this uses the join on common key column method as described in the accepted answer. The upsides of using the API is that it saves you a lot of typing and handles some corner cases pretty well. I'd almost always recommend this syntax as my first preference for cartesian product in pandas unless you're looking for something more performant.

Just checked https://github.com/pandas-dev/pandas/releases/tag/v1.2.0, pandas 1.2 was released December 26, 2020. The cross merge works for me! — Yi Zong Kuang, Jan 02 '21 at 07:17

score 16 · Answer 7 · answered Nov 03 '16 at 14:42

As an alternative, one can rely on the cartesian product provided by itertools: itertools.product, which avoids creating a temporary key or modifying the index:

import numpy as np 
import pandas as pd 
import itertools

def cartesian(df1, df2):
    rows = itertools.product(df1.iterrows(), df2.iterrows())

    df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
    return df.reset_index(drop=True)

Quick test:

In [46]: a = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])

In [47]: b = pd.DataFrame(np.random.rand(5, 3), columns=["d", "e", "f"])    

In [48]: cartesian(a,b)
Out[48]:
           a         b         c         d         e         f
0   0.436480  0.068491  0.260292  0.991311  0.064167  0.715142
1   0.436480  0.068491  0.260292  0.101777  0.840464  0.760616
2   0.436480  0.068491  0.260292  0.655391  0.289537  0.391893
3   0.436480  0.068491  0.260292  0.383729  0.061811  0.773627
4   0.436480  0.068491  0.260292  0.575711  0.995151  0.804567
5   0.469578  0.052932  0.633394  0.991311  0.064167  0.715142
6   0.469578  0.052932  0.633394  0.101777  0.840464  0.760616
7   0.469578  0.052932  0.633394  0.655391  0.289537  0.391893
8   0.469578  0.052932  0.633394  0.383729  0.061811  0.773627
9   0.469578  0.052932  0.633394  0.575711  0.995151  0.804567
10  0.466813  0.224062  0.218994  0.991311  0.064167  0.715142
11  0.466813  0.224062  0.218994  0.101777  0.840464  0.760616
12  0.466813  0.224062  0.218994  0.655391  0.289537  0.391893
13  0.466813  0.224062  0.218994  0.383729  0.061811  0.773627
14  0.466813  0.224062  0.218994  0.575711  0.995151  0.804567
15  0.831365  0.273890  0.130410  0.991311  0.064167  0.715142
16  0.831365  0.273890  0.130410  0.101777  0.840464  0.760616
17  0.831365  0.273890  0.130410  0.655391  0.289537  0.391893
18  0.831365  0.273890  0.130410  0.383729  0.061811  0.773627
19  0.831365  0.273890  0.130410  0.575711  0.995151  0.804567
20  0.447640  0.848283  0.627224  0.991311  0.064167  0.715142
21  0.447640  0.848283  0.627224  0.101777  0.840464  0.760616
22  0.447640  0.848283  0.627224  0.655391  0.289537  0.391893
23  0.447640  0.848283  0.627224  0.383729  0.061811  0.773627
24  0.447640  0.848283  0.627224  0.575711  0.995151  0.804567

I tested this and it works, but it is much slower than above merge answers for large datasets. — MrJ, Feb 12 '18 at 22:36
@MrJ there's no other reason besides the usage of iterrows() here, it absolutely destroys any semblance of efficiency and takes minutes or hours for even a few thousand rows. Not worth it — cs95, Nov 28 '20 at 22:57

score 2 · Answer 8 · answered Mar 25 '13 at 02:58

2

If you have no overlapping columns, don't want to add one, and the indices of the data frames can be discarded, this may be easier:

df1.index[:] = df2.index[:] = 0
df_cartesian = df1.join(df2, how='outer')
df_cartesian.index[:] = range(len(df_cartesian))

answered Mar 25 '13 at 02:58

sergeyk

37
1

2

This looks promising - but I get the error on the first line: `TypeError: '' does not support mutable operations.` I can get around this by adding `, index=[0,0]` to the dataframe definition though. – Racing Tadpole May 22 '14 at 00:19
2

Or using `df1 = df1.set_index([[0]*len(df1)]))` (and similarly for `df2`). – Racing Tadpole May 22 '14 at 00:42
Racing Tadpole's edits made this work for me - thanks! – Sevyns Mar 10 '17 at 21:19

score 2 · Answer 9 · answered May 06 '20 at 09:02

Here is a helper function to perform a simple Cartesian product with two data frames. The internal logic handles using an internal key, and avoids mangling any columns that happen to be named "key" from either side.

import pandas as pd

def cartesian(df1, df2):
    """Determine Cartesian product of two data frames."""
    key = 'key'
    while key in df1.columns or key in df2.columns:
        key = '_' + key
    key_d = {key: 0}
    return pd.merge(
        df1.assign(**key_d), df2.assign(**key_d), on=key).drop(key, axis=1)

# Two data frames, where the first happens to have a 'key' column
df1 = pd.DataFrame({'number':[1, 2], 'key':[3, 4]})
df2 = pd.DataFrame({'digit': [5, 6]})
cartesian(df1, df2)

shows:

   number  key  digit
0       1    3      5
1       1    3      6
2       2    4      5
3       2    4      6

score 1 · Answer 10 · answered Nov 03 '19 at 16:22

You could start by taking the Cartesian product of df1.col1 and df2.col3, then merge back to df1 to get col2.

Here's a general Cartesian product function which takes a dictionary of lists:

def cartesian_product(d):
    index = pd.MultiIndex.from_product(d.values(), names=d.keys())
    return pd.DataFrame(index=index).reset_index()

Apply as:

res = cartesian_product({'col1': df1.col1, 'col3': df2.col3})
pd.merge(res, df1, on='col1')
#  col1 col3 col2
# 0   1    5    3
# 1   1    6    3
# 2   2    5    4
# 3   2    6    4

score 0 · Answer 11 · answered Sep 27 '21 at 20:44

Yet another workaround for the current version of Pandas (1.1.5): this one is particularly useful if you're starting off with a non-dataframe sequence. I haven't timed it. It does not require any artificial index manipulation, but does require that you repeat the second sequence. It relies on a special property of explode, namely that the right-hand index is repeated.

df1 = DataFrame({'col1': [1,2], 'col2': [3,4]})

series2 = Series(
    [[5, 6]]*len(df1),
    name='col3',
    index=df1.index,
)

df_cartesian = df1.join(series2.explode())

This outputs

   col1  col2 col3
0     1     3    5
0     1     3    6
1     2     4    5
1     2     4    6

score 0 · Answer 12 · answered Oct 11 '21 at 21:22

You could use expand_grid from pyjanitor to replicate a cross join; it offers some speed performance for larger datasets (it uses np.meshgrid underneath):

pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
jn.expand_grid(others = {"df1":df1, "df2":df2})

   df1       df2
  col1 col2 col3
0    1    3    5
1    1    3    6
2    2    4    5
3    2    4    6

score -1 · Answer 13 · answered Sep 05 '17 at 06:43

-1

I find using pandas MultiIndex to be the best tool for the job. If you have a list of lists lists_list, call pd.MultiIndex.from_product(lists_list) and iterate over the result (or use it in DataFrame index).

answered Sep 05 '17 at 06:43

Ankur Kanoria

957
8
9

cartesian product in pandas

13 Answers13

pandas >= 1.2

`left.merge(right, how='cross')`

Linked

Related