Join unique values into new data frame (python, pandas)

Question

I have two dataFrames, from where I extract the unique values of a column into a and b

a = df1.col1.unique()
b = df2.col2.unique()

now a and b are something like this

['a','b','c','d'] #a
[1,2,3] #b

they are now type numpy.ndarray

I want to join them to have a DataFrame like this

   col1  col2
0    a     1
1    a     2
3    a     3
4    b     1
5    b     2
6    b     3
7    c     1
   . . .

Is there a way to do it not using a loop?

@RoseAlejandra - No, I'm asking if a list comprehension is acceptable in order to create the DataFrame. You say without a 'for loop', which list comprehensions use implicitly, but not explicitly. — Akshat Mahajan, Apr 20 '16 at 19:58
@RosaAlejandra, please pay attention at B. M.'s solution - it's __much__ faster — MaxU - stand with Ukraine, Apr 20 '16 at 20:33

score 1 · Answer 1 · answered Apr 20 '16 at 20:19

1

with numpy tools :

pd.DataFrame({'col1':np.repeat(a,b.size),'col2':np.tile(b,a.size)})

answered Apr 20 '16 at 20:19

B. M.

18,243
2
35
54

MaxU - stand with Ukraine · Accepted Answer · 2016-04-20T20:27:35.707

UPDATE:

B. M.'s solution utilizing numpy is much faster - i would recommend to use his approach:

In [88]: %timeit pd.DataFrame({'col1':np.repeat(aa,bb.size),'col2':np.tile(bb,aa.size)})
10 loops, best of 3: 25.4 ms per loop

In [89]: %timeit pd.DataFrame(list(product(aa,bb)), columns=['col1', 'col2'])
1 loop, best of 3: 1.28 s per loop

In [90]: aa.size
Out[90]: 1000

In [91]: bb.size
Out[91]: 1000

try itertools.product:

In [56]: a
Out[56]:
array(['a', 'b', 'c', 'd'],
      dtype='<U1')

In [57]: b
Out[57]: array([1, 2, 3])

In [63]: pd.DataFrame(list(product(a,b)), columns=['col1', 'col2'])
Out[63]:
   col1  col2
0     a     1
1     a     2
2     a     3
3     b     1
4     b     2
5     b     3
6     c     1
7     c     2
8     c     3
9     d     1
10    d     2
11    d     3

Since `itertools.product(a,b)` returns an iterator consisting of tuples, I suspect the additional list comprehension to `[[x[0],x[1]]` was unnecessary. — Akshat Mahajan, Apr 20 '16 at 20:00

score 0 · Answer 3 · edited May 23 '17 at 12:07

You can't do this task without using at least one for loop. The best you can do is hide the for loop or make use of implicit yield calls to make a memory-efficient generator.

itertools exports efficient functions for this task that use yield implicitly to return generators:

from itertools import product

products = product(['a','b','c','d'], [1,2,3])

col1_items, col2_items = zip(*products)

result = pandas.DataFrame({'col1':col1_items, 'col2': col2_items})

itertools.product creates a Cartesian product of two iterables. The zip(*products) simply unpacks the resulting list of tuples into two separate tuples, as seen here.

score 0 · Answer 4 · answered Apr 20 '16 at 20:08

You can do this with pandas merge and it will be faster than itertools or a loop:

df_a = pd.DataFrame({'a': a, 'key': 1})
df_b = pd.DataFrame({'b': b, 'key': 1})
result = pd.merge(df_a, df_b, how='outer')

result:

    a  key  b
0   a    1  1
1   a    1  2
2   a    1  3
3   b    1  1
4   b    1  2
5   b    1  3
6   c    1  1
7   c    1  2
8   c    1  3
9   d    1  1
10  d    1  2
11  d    1  3

then if need be you can always do

del result['key']

Join unique values into new data frame (python, pandas)

4 Answers4