0

I have two dataframes with thousands of rows, I need to combine both into one dataframe without duplicate and reversion. for example:

Dataframe 1

drug1
drug2
drug3

Dataframe 2

disease1
disease2
disease3

So, the output dataframe will be:

output-dataframe

drug1 disease1
drug1 disease2
drug1 disease3
drug2 disease1
drug2 disease2
drug2 disease3 
drug3 disease1
drug3 disease2
drug3 disease3

I don't want the output combination containing something like:

disease1 drug1
drug1 drug1
disease1 disease1 

I actually try it using pd.merge but it return duplicate and reversion and also took long time because I have thousands in Dataframes 1 and 2

Any help please ?

LamaMo
  • 576
  • 2
  • 8
  • 19
  • This is the [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) and I'm sure is available in `itertools` – Emil Aug 07 '18 at 20:25
  • No, it's ok I want drug2 disease2 and drug3 disease3 @ScottBoston – LamaMo Aug 07 '18 at 20:34
  • I think what @ScottBoston is getting at is that you have `drug1 disease1` in your "undesired" output. – Emil Aug 07 '18 at 20:39
  • 1
    ooh sorry I mixed, I mean if I have one combination I don't want the opposite @ScottBoston – LamaMo Aug 07 '18 at 20:42
  • 2
    Possible duplicate of [cartesian product in pandas](https://stackoverflow.com/questions/13269890/cartesian-product-in-pandas) – Ben.T Aug 07 '18 at 20:43
  • If this is Spark 2.0+, why are most of these answers not using SQL? This seems fairly straightforward with just a simple `full outer join`, and I know you have to likely generate a field to join on, but I don't understand why users typically go down to the APIs or even use Pandas... – simplycoding Aug 07 '18 at 21:41
  • It's not Spark or SQL, as stated in the topic it's in python @simplycoding – LamaMo Aug 08 '18 at 06:04
  • Ah woops, not sure why I thought this was tagged with Spark – simplycoding Aug 08 '18 at 20:49

3 Answers3

2

One way purely in pandas would be to create a MultiIndex from product, then convert it to a dataframe:

>>> df1
       0
0  drug1
1  drug2
2  drug3
>>> df2
          0
0  disease1
1  disease2
2  disease3

df3 = (pd.MultiIndex.from_product([df1[0],df2[0]])
       .to_frame()
       .reset_index(drop=True))

>>> df3
       0         1
0  drug1  disease1
1  drug1  disease2
2  drug1  disease3
3  drug2  disease1
4  drug2  disease2
5  drug2  disease3
6  drug3  disease1
7  drug3  disease2
8  drug3  disease3
sacuL
  • 49,704
  • 8
  • 81
  • 106
1

Setup

df1 = pd.DataFrame(dict(col1=[f"drug{i}" for i in range(1, 4)]))
df2 = pd.DataFrame(dict(col2=[f"disease{i}" for i in range(1, 4)]))

merge on assigned column

df1.assign(A=1).merge(df2.assign(A=1)).drop('A', 1)

    col1      col2
0  drug1  disease1
1  drug1  disease2
2  drug1  disease3
3  drug2  disease1
4  drug2  disease2
5  drug2  disease3
6  drug3  disease1
7  drug3  disease2
8  drug3  disease3

comprehension

pd.DataFrame([
    (i, j) for i in df1.col1
           for j in df2.col2
], columns=['col1', 'col2'])

pandas.concat

generalized for the cross product of any two dataframes

i = df1.index.repeat(len(df2))
j = np.tile(df2.index, len(df1))

pd.concat([
    df1.loc[i].reset_index(drop=True),
    df2.loc[j].reset_index(drop=True)
], sort=True, axis=1)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • In setup, my drugs and diseases are not numbered like that, they are any name – LamaMo Aug 07 '18 at 20:33
  • What do you mean they aren't numbered like that? It seems the same as your input. – piRSquared Aug 07 '18 at 20:37
  • it just an example, so in reality drugs and diseases are are just like CID00757 DOID_3762 @piRSquared – LamaMo Aug 07 '18 at 20:45
  • Ok. The solution should generalize to whatever your values are. I used the example you gave. What example should I have used? – piRSquared Aug 07 '18 at 20:46
  • I mean your solution seems it hard coded, how can I apply it for any names for drugs and diseases ? – LamaMo Aug 07 '18 at 20:48
  • I use the **Setup** section so that anyone reading can reproduce my results. I don't expect that you (the OP) needs to use it. You should be able to skip right to the solution. However, my solution assumes the column names `'col1'` and `'col2'`. You could replace those with your own column names since you did not provide them. That said, my third solution should work on any dataframes given that their resprective indices are unique. – piRSquared Aug 07 '18 at 20:54
1

Try this solution:

from pandas import DataFrame, merge

df1['key'] = 1
df2['key'] = 1

result = df1.merge(df2, on='key').drop('key', axis=1)
Lev Zakharov
  • 2,409
  • 1
  • 10
  • 24