1

I'd like to create a third column as a result of a cross join between my Columns A and B:

import pandas as pd
import numpy as np
df = pd.read_csv("data.csv", sep=",")
df
#    A    B    
# 0  0  Yes 
# 1  8   No 
# 2  2  Yes 
# 3  4  Maybe
# 4  6  NA

They have the following unique values:

>>> df['A'].drop_duplicates()
0       0
2       8
41      4
119     2
1246    3
1808    1
Name: A, dtype: int64

>>> df['B'].drop_duplicates()
                
0              NA
2           Maybe
320            No
5575          Yes
Name: B, dtype: object

I'd like to have a df['C'] with the combination of all cross joins, thus we should have 6 * 4 = 24 unique values in it:

#Column C should have 6 * 4 classes:

(1,Yes)=1  (1,No)=6  (1, Maybe)=12  (1, NA)=18
(2,Yes)=2  (2,No)=7  (2, maybe)=13    ...
(3,Yes)=3  (3,No)=8  ...
(4,Yes)=4  (4,No)=9
(8,Yes)=5   ...
(0,Yes)=0

Thus we should have the following:

Newdf
#    A    B    C  
# 0  0  Yes    0
# 1  8   No    9
# 2  2  Yes    2
# 3  4  Maybe  15
# 4  8  NA     22

Using this method, I have the following error:

out = df.merge(df[['B']].drop_duplicates().merge(df['A'].drop_duplicates(),how='cross').assign(C=lambda x : x.index+1))

Throws:

"No common columns to perform merge on. "
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

Any help would be appreciated.

Amir
  • 1,348
  • 3
  • 21
  • 44
  • 1
    can't you compute a simple `crosstab`: `pd.crosstab(df['A'], df['B'])`. If not, can you update your question with a self-sufficient and reproducible minimal example with the **exact matching expected output**? – mozway Jun 02 '22 at 13:56
  • @mozway, How do I use the output of crosstab to generate a new column with correct indexing? – Amir Jun 02 '22 at 14:00
  • It depends, as stated above we would need a clear minimal reproducible example to be sure. – mozway Jun 02 '22 at 14:02
  • @mozway, updated the question with the expected output. – Amir Jun 02 '22 at 14:06
  • no, this is not a minimal reproducible example, this is an incomplete description. You need to provide a clear DataFrame object as **input** and the clear **complete** matching output. Please read the [guidelines](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – mozway Jun 02 '22 at 14:07

1 Answers1

1

Why don't you use the good old itertools:

from itertools import product   
cats = list(product(df['A'].unique(), df['B'].unique()))

# merge with this
pd.DataFrame(cats, columns=['A','B']).assign(C=range(len(cats)))

Output for the example data:

    A      B   C
0   0    Yes   0
1   0     No   1
2   0  Maybe   2
3   0    NaN   3
4   8    Yes   4
5   8     No   5
6   8  Maybe   6
7   8    NaN   7
8   2    Yes   8
9   2     No   9
10  2  Maybe  10
11  2    NaN  11
12  4    Yes  12
13  4     No  13
14  4  Maybe  14
15  4    NaN  15
16  6    Yes  16
17  6     No  17
18  6  Maybe  18
19  6    NaN  19
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks Quang Hoang, it gives me the third column with 24 classes, but how do I generate the column C for all my data? Also, is it guaranteed to generate a unique mapping every time I use the product? – Amir Jun 02 '22 at 14:10
  • 1
    See my commnet, `df.merge(that_output, on=['A','B'])`. Yes, `C` is unique mapping because the code has `df['A'].unique()`, not because of `product`. – Quang Hoang Jun 02 '22 at 14:11
  • The `df.merge(that_output, on=['A','B'])` throws an error for `KeyError: 'A'` – Amir Jun 02 '22 at 14:19
  • 1
    `A` and `B` in the merge should be the same with the `A` and `B` in your question? – Quang Hoang Jun 02 '22 at 14:20
  • All good now. I only see my rows reshuffled with the new order of column['C']. Can we avoid this? – Amir Jun 02 '22 at 14:31
  • 1
    should not be happenning. You may have done `that_output.merge(df...`. – Quang Hoang Jun 02 '22 at 14:35
  • no I followed the exact script provided. The rows are reordered with the Column['C'], starting with all rows with 0 to all row with 23 – Amir Jun 02 '22 at 14:56
  • The other thing is, this method will create the product of the classes, no matter if the combination of ['A'] and ['B'] was available in the current DataFrame right? Say, there was no row with 3, No, but still we have a class 8 generated for `(3,No)=8`? – Amir Jun 02 '22 at 15:05