1

I have two dataFrames, from where I extract the unique values of a column into a and b

a = df1.col1.unique()
b = df2.col2.unique()

now a and b are something like this

['a','b','c','d'] #a
[1,2,3] #b

they are now type numpy.ndarray

I want to join them to have a DataFrame like this

   col1  col2
0    a     1
1    a     2
3    a     3
4    b     1
5    b     2
6    b     3
7    c     1
   . . .

Is there a way to do it not using a loop?

Rosa Alejandra
  • 732
  • 5
  • 21

4 Answers4

1

with numpy tools :

pd.DataFrame({'col1':np.repeat(a,b.size),'col2':np.tile(b,a.size)})
B. M.
  • 18,243
  • 2
  • 35
  • 54
0

UPDATE:

B. M.'s solution utilizing numpy is much faster - i would recommend to use his approach:

In [88]: %timeit pd.DataFrame({'col1':np.repeat(aa,bb.size),'col2':np.tile(bb,aa.size)})
10 loops, best of 3: 25.4 ms per loop

In [89]: %timeit pd.DataFrame(list(product(aa,bb)), columns=['col1', 'col2'])
1 loop, best of 3: 1.28 s per loop

In [90]: aa.size
Out[90]: 1000

In [91]: bb.size
Out[91]: 1000

try itertools.product:

In [56]: a
Out[56]:
array(['a', 'b', 'c', 'd'],
      dtype='<U1')

In [57]: b
Out[57]: array([1, 2, 3])

In [63]: pd.DataFrame(list(product(a,b)), columns=['col1', 'col2'])
Out[63]:
   col1  col2
0     a     1
1     a     2
2     a     3
3     b     1
4     b     2
5     b     3
6     c     1
7     c     2
8     c     3
9     d     1
10    d     2
11    d     3
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Since `itertools.product(a,b)` returns an iterator consisting of tuples, I suspect the additional list comprehension to `[[x[0],x[1]]` was unnecessary. – Akshat Mahajan Apr 20 '16 at 20:00
0

You can't do this task without using at least one for loop. The best you can do is hide the for loop or make use of implicit yield calls to make a memory-efficient generator.

itertools exports efficient functions for this task that use yield implicitly to return generators:

from itertools import product

products = product(['a','b','c','d'], [1,2,3])

col1_items, col2_items = zip(*products)

result = pandas.DataFrame({'col1':col1_items, 'col2': col2_items})

itertools.product creates a Cartesian product of two iterables. The zip(*products) simply unpacks the resulting list of tuples into two separate tuples, as seen here.

Community
  • 1
  • 1
Akshat Mahajan
  • 9,543
  • 4
  • 35
  • 44
0

You can do this with pandas merge and it will be faster than itertools or a loop:

df_a = pd.DataFrame({'a': a, 'key': 1})
df_b = pd.DataFrame({'b': b, 'key': 1})
result = pd.merge(df_a, df_b, how='outer')

result:

    a  key  b
0   a    1  1
1   a    1  2
2   a    1  3
3   b    1  1
4   b    1  2
5   b    1  3
6   c    1  1
7   c    1  2
8   c    1  3
9   d    1  1
10  d    1  2
11  d    1  3

then if need be you can always do

del result['key']
thetainted1
  • 451
  • 3
  • 4