6

So right now, if I multiple a list i.e. x = [1,2,3]* 2 I get x as [1,2,3,1,2,3] But this doesn't work with Pandas.

So if I want to duplicate a PANDAS DF I have to make a column a list and multiple:

col_x_duplicates =  list(df['col_x'])*N

new_df = DataFrame(col_x_duplicates, columns=['col_x'])

Then do a join on the original data:

pd.merge(new_df, df, on='col_x', how='left')

This now duplicates the pandas DF N times, Is there an easier way? Or even a quicker way?

redrubia
  • 2,256
  • 6
  • 33
  • 47

2 Answers2

8

Actually, since you want to duplicate the entire dataframe (and not each element), numpy.tile() may be better:

In [69]: import pandas as pd

In [70]: arr = pd.np.array([[1, 2, 3], [4, 5, 6]])

In [71]: arr
Out[71]: 
array([[1, 2, 3],
       [4, 5, 6]])

In [72]: df = pd.DataFrame(pd.np.tile(arr, (5, 1)))

In [73]: df
Out[73]: 
   0  1  2
0  1  2  3
1  4  5  6
2  1  2  3
3  4  5  6
4  1  2  3
5  4  5  6
6  1  2  3
7  4  5  6
8  1  2  3
9  4  5  6

[10 rows x 3 columns]

In [75]: df = pd.DataFrame(pd.np.tile(arr, (1, 3)))

In [76]: df
Out[76]: 
   0  1  2  3  4  5  6  7  8
0  1  2  3  1  2  3  1  2  3
1  4  5  6  4  5  6  4  5  6

[2 rows x 9 columns]
capitalistcuttle
  • 1,709
  • 2
  • 20
  • 28
  • Thanks this is great! Shame is seems soo slow when running it on a large pandas df! – redrubia Jan 28 '14 at 19:10
  • You know if theres a quick way? – redrubia Jan 28 '14 at 19:21
  • @redrubia Are you calling tile() several times? It may be slow because you're allocating additional memory each time. If you know the final size (after all duplication), you could try initializing a zeros numpy array of that size, and then fill it in using slicing. – capitalistcuttle Jan 28 '14 at 20:06
  • @redrubia Or, if you don't need to modify the duplicated data, see if you can refactor your code so you're saving the indices somewhere and just accessing the same dataframe repeatedly, instead of creating a new tiled dataframe. That way you don't pay the cost of allocating more memory. This is another way of doing the same thing: http://stackoverflow.com/questions/5564098/repeat-numpy-array-without-replicating-data – capitalistcuttle Jan 28 '14 at 20:11
5

Here is a one-liner to make a DataFrame with n copies of DataFrame df

n_df = pd.concat([df] * n)

Example:

df = pd.DataFrame(
    data=[[34, 'null', 'mark'], [22, 'null', 'mark'], [34, 'null', 'mark']], 
    columns=['id', 'temp', 'name'], 
    index=pd.Index([1, 2, 3], name='row')
)
n = 4
n_df = pd.concat([df] * n)

Then n_df is the following DataFrame:

    id  temp    name
row         
1   34  null    mark
2   22  null    mark
3   34  null    mark
1   34  null    mark
2   22  null    mark
3   34  null    mark
1   34  null    mark
2   22  null    mark
3   34  null    mark
1   34  null    mark
2   22  null    mark
3   34  null    mark
Dr Fabio Gori
  • 1,105
  • 16
  • 21
  • Please note, that this answer leads to a different (i.e. repeating) index labeling than the accepted answer. This may or may not be what you want, depending on your use-case. I dont think the OP has expressed any preference with regards to the index labels. – ckrk Feb 05 '22 at 11:55
  • Just to note that `concat` method is way slower (some ~100x) than `np.tile` method. – Matti Jan 12 '23 at 08:37