5

I am trying to obtain a dataframe which has all the combinations of two individual columns in two different dataframes. My dataframes look like this:

>>>first_df                          >>>second_df
    id test                                id text   
 0   1  abc                              0 11  uvw
 1   2  def                              1 22  xyz 
 2   3  ghi

From this, I was able to obtain the combinations using this approach:

df = pd.DataFrame(list(itertools.product(list(a['test']),list(b['text']))),columns=['test','text'])
>>>df
    test text
 0  abc  uvw
 1  abc  xyz
 2  def  uvw
 3  def  xyz
 4  ghi  uvw
 5  ghi  xyz

What I am unable to understand is,how do I get the relevant id columns also into my dataframe so it looks like:

>>>df
    id test text kid
 0   1 abc  uvw   11
 1   1 abc  xyz   22
 2   2 def  uvw   11
 3   2 def  xyz   22
 4   3 ghi  uvw   11
 5   3 ghi  xyz   22

I tried making combinations on the id columns separately

df1 =pd.DataFrame(list(itertools.product(list(a['id']),list(a['id']))),columns=['id','id'])
df
   id    id
0   1     1
1   1     2
2   1     3
3   2     1
4   2     2
5   2     3
6   3     1
7   3     2
8   3     3

df2 =pd.DataFrame(list(itertools.product(list(b['kid']),list(b['kid']))),columns=['kid','kid'])
>>>df2
   id  kid
0  11   11
1  11   22
2  22   11
3  22   22

Which I then tried to concatenate..This obviously failed

df = pd.concat([df['id'],df2,df1['kid']],axis=1)

>>> df
   id test text   kid
0   1  abc  uvw  11.0
1   1  abc  xyz  22.0
2   1  def  uvw  11.0
3   2  def  xyz  22.0
4   2  ghi  uvw   NaN
5   2  ghi  xyz   NaN
6   3  NaN  NaN   NaN
7   3  NaN  NaN   NaN
8   3  NaN  NaN   NaN

I have a feeling that I can maybe solve this using apply function of dataframes but I just cant figure out how. Any leads would be appreciated. Thank you for reading so much :)

sleepophile
  • 659
  • 1
  • 7
  • 19

1 Answers1

9

You can use cross join with merge and new helper columns with same constants:

first_df['tmp'] = 1
second_df['tmp'] = 1

df = pd.merge(first_df, second_df.rename(columns={'id':'kid'}), on='tmp').drop('tmp',1)
print (df)
   id test  kid text
0   1  abc   11  uvw
1   1  abc   22  xyz
2   2  def   11  uvw
3   2  def   22  xyz
4   3  ghi   11  uvw
5   3  ghi   22  xyz

One line solution with assign for new columns:

df = pd.merge(first_df.assign(tmp=1), 
              second_df.assign(tmp=1).rename(columns={'id':'kid'}), on='tmp').drop('tmp',1)
print (df)
   id test  kid text
0   1  abc   11  uvw
1   1  abc   22  xyz
2   2  def   11  uvw
3   2  def   22  xyz
4   3  ghi   11  uvw
5   3  ghi   22  xyz
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Ohk! wow! That means I don't have to use all the procedure I am working on right now? – sleepophile Aug 27 '17 at 13:34
  • I think no, cross join is what you need. – jezrael Aug 27 '17 at 13:35
  • Would this also work if I had multiple columns? Other than those that I have right now? – sleepophile Aug 27 '17 at 13:36
  • yes, I think it workig with multiple columns also if need combination all rows from one df with all rows with second df. – jezrael Aug 27 '17 at 13:38
  • Um yes. That is what I have been trying to achieve since past 2 hours. Combination of one column in one dataframe with others in another dataframe – sleepophile Aug 27 '17 at 13:39
  • 1
    Beat me to the one-liner with `assign`! But OP be careful of cross join for very large datasets as you will return the [Cartesian product](https://en.m.wikipedia.org/wiki/Cartesian_product) of A x B of both data frames. – Parfait Aug 27 '17 at 13:40
  • Actually that is what I was worried about. But my join is only bothered with only 5 columns at max from both the dataframes – sleepophile Aug 27 '17 at 13:43
  • Hmmm, columns are not problem, what is size of both df? If very large, it can be slow. – jezrael Aug 27 '17 at 13:45
  • 300k rows. This is my closest estimate for the size in the near future. Also can I know why are we renaming the columns? It is the same output without it too – sleepophile Aug 27 '17 at 13:46
  • in both dfs? what is size of RAM? in my opinion if slow my solution, is necessary use [RDBMS](https://stackoverflow.com/a/29910919/2901002) or check another alternatives working with large data like spark, blaze, dask... – jezrael Aug 27 '17 at 13:53
  • RAM is quite huge, 128gb. That should be able to handle it right? I am guessing that would suffice – sleepophile Aug 27 '17 at 14:07
  • It should be handled I think. – jezrael Aug 27 '17 at 14:08