Combination of two dataframes without duplicate and reversion in efficient way | python

Question

I have two dataframes with thousands of rows, I need to combine both into one dataframe without duplicate and reversion. for example:

Dataframe 1

drug1
drug2
drug3

Dataframe 2

disease1
disease2
disease3

So, the output dataframe will be:

output-dataframe

drug1 disease1
drug1 disease2
drug1 disease3
drug2 disease1
drug2 disease2
drug2 disease3 
drug3 disease1
drug3 disease2
drug3 disease3

I don't want the output combination containing something like:

disease1 drug1
drug1 drug1
disease1 disease1

I actually try it using pd.merge but it return duplicate and reversion and also took long time because I have thousands in Dataframes 1 and 2

Any help please ?

This is the [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) and I'm sure is available in `itertools` — Emil, Aug 07 '18 at 20:25
No, it's ok I want drug2 disease2 and drug3 disease3 @ScottBoston — LamaMo, Aug 07 '18 at 20:34
I think what @ScottBoston is getting at is that you have `drug1 disease1` in your "undesired" output. — Emil, Aug 07 '18 at 20:39
ooh sorry I mixed, I mean if I have one combination I don't want the opposite @ScottBoston — LamaMo, Aug 07 '18 at 20:42
Possible duplicate of [cartesian product in pandas](https://stackoverflow.com/questions/13269890/cartesian-product-in-pandas) — Ben.T, Aug 07 '18 at 20:43
If this is Spark 2.0+, why are most of these answers not using SQL? This seems fairly straightforward with just a simple `full outer join`, and I know you have to likely generate a field to join on, but I don't understand why users typically go down to the APIs or even use Pandas... — simplycoding, Aug 07 '18 at 21:41
It's not Spark or SQL, as stated in the topic it's in python @simplycoding — LamaMo, Aug 08 '18 at 06:04

score 2 · Answer 1 · answered Aug 07 '18 at 20:26

One way purely in pandas would be to create a MultiIndex from product, then convert it to a dataframe:

>>> df1
       0
0  drug1
1  drug2
2  drug3
>>> df2
          0
0  disease1
1  disease2
2  disease3

df3 = (pd.MultiIndex.from_product([df1[0],df2[0]])
       .to_frame()
       .reset_index(drop=True))

>>> df3
       0         1
0  drug1  disease1
1  drug1  disease2
2  drug1  disease3
3  drug2  disease1
4  drug2  disease2
5  drug2  disease3
6  drug3  disease1
7  drug3  disease2
8  drug3  disease3

piRSquared · Answer 2 · 2018-08-07T20:37:15.937

1

Setup

df1 = pd.DataFrame(dict(col1=[f"drug{i}" for i in range(1, 4)]))
df2 = pd.DataFrame(dict(col2=[f"disease{i}" for i in range(1, 4)]))

`merge` on assigned column

df1.assign(A=1).merge(df2.assign(A=1)).drop('A', 1)

    col1      col2
0  drug1  disease1
1  drug1  disease2
2  drug1  disease3
3  drug2  disease1
4  drug2  disease2
5  drug2  disease3
6  drug3  disease1
7  drug3  disease2
8  drug3  disease3

comprehension

pd.DataFrame([
    (i, j) for i in df1.col1
           for j in df2.col2
], columns=['col1', 'col2'])

`pandas.concat`

generalized for the cross product of any two dataframes

i = df1.index.repeat(len(df2))
j = np.tile(df2.index, len(df1))

pd.concat([
    df1.loc[i].reset_index(drop=True),
    df2.loc[j].reset_index(drop=True)
], sort=True, axis=1)

edited Aug 07 '18 at 20:37

answered Aug 07 '18 at 20:28

piRSquared

285,575
57
475
624

In setup, my drugs and diseases are not numbered like that, they are any name – LamaMo Aug 07 '18 at 20:33
What do you mean they aren't numbered like that? It seems the same as your input. – piRSquared Aug 07 '18 at 20:37
it just an example, so in reality drugs and diseases are are just like CID00757 DOID_3762 @piRSquared – LamaMo Aug 07 '18 at 20:45
Ok. The solution should generalize to whatever your values are. I used the example you gave. What example should I have used? – piRSquared Aug 07 '18 at 20:46
I mean your solution seems it hard coded, how can I apply it for any names for drugs and diseases ? – LamaMo Aug 07 '18 at 20:48
I use the **Setup** section so that anyone reading can reproduce my results. I don't expect that you (the OP) needs to use it. You should be able to skip right to the solution. However, my solution assumes the column names `'col1'` and `'col2'`. You could replace those with your own column names since you did not provide them. That said, my third solution should work on any dataframes given that their resprective indices are unique. – piRSquared Aug 07 '18 at 20:54

score 1 · Accepted Answer · answered Aug 07 '18 at 20:28

1

Try this solution:

from pandas import DataFrame, merge

df1['key'] = 1
df2['key'] = 1

result = df1.merge(df2, on='key').drop('key', axis=1)

answered Aug 07 '18 at 20:28

Lev Zakharov

2,409
1
10
24

This is what I tried before and return duplicate and reversion :( – LamaMo Aug 07 '18 at 20:30
But it shouldn't. Do you have duplicates in data frames? – Lev Zakharov Aug 07 '18 at 20:32
No, I don't but this took long time also @Lev Zakharov – LamaMo Aug 07 '18 at 20:36

Combination of two dataframes without duplicate and reversion in efficient way | python

3 Answers3

Setup

merge on assigned column

comprehension

pandas.concat

`merge` on assigned column

`pandas.concat`