6

I have a pandas.DataFrame as follows:

df1 = 
    a    b
0   1    2
1   3    4

I'd like to make this three times to become:

df2 =
    a    b
0   1    2
0   1    2
0   1    2
1   3    4
1   3    4
1   3    4

df2 is made from a loop, but it is not efficient.

How can I get df2 from df1 using a matrix way which is faster?

Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
李博洋
  • 81
  • 3
  • *"one by one"* doesn't say whether you mean by row or by column. You want to duplicate each **row** n times. – smci Dec 08 '19 at 01:08

5 Answers5

5

Build a one dimensional indexer to slice both the the values array and index. You must take care of the index as well to get your desired results.

  • use np.repeat on an np.arange to get the indexer
  • construct a new dataframe using this indexer on both values and the index

r = np.arange(len(df)).repeat(3)
pd.DataFrame(df.values[r], df.index[r], df.columns)

   a  b
0  1  2
0  1  2
0  1  2
1  3  4
1  3  4
1  3  4
piRSquared
  • 285,575
  • 57
  • 475
  • 624
3

You can use np.repeat

df = pd.DataFrame(np.repeat(df.values,[3,3], axis = 0), columns = df.columns)

You get

    a   b
0   1   2
1   1   2
2   1   2
3   3   4
4   3   4
5   3   4

Time testing:

%timeit pd.DataFrame(np.repeat(df.values,[3,3], axis = 0))
1000 loops, best of 3: 235 µs per loop

%timeit pd.concat([df] * 3).sort_index()
best of 3: 1.26 ms per loop

Numpy is definitely faster in most cases so no surprises there

EDIT: I am not sure if you would be looking for repeating indices but incase you do,

pd.DataFrame(np.repeat(df.values,3, axis = 0), index = np.repeat(df.index, 3), columns = df.columns)
Vaishali
  • 37,545
  • 5
  • 58
  • 86
2

I do not know if it is more efficient than your loop, but it easy enough to construct as:

Code:

pd.concat([df] * 3).sort_index()

Test Code:

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('ab'))
print(pd.concat([df] * 3).sort_index())

Results:

   a  b
0  1  2
0  1  2
0  1  2
1  3  4
1  3  4
1  3  4
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
2

You can use numpy.repeat with parameter scalar 3 and then add columns parameter to DataFrame constructor:

df = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print (df)
   a  b
0  1  2
1  1  2
2  1  2
3  3  4
4  3  4
5  3  4

If really want duplicated index what can complicated some pandas functions like reindex which failed:

r = np.repeat(np.arange(len(df.index)), 3)
df = pd.DataFrame(df.values[r], df.index[r], df.columns)
print (df)
   a  b
0  1  2
0  1  2
0  1  2
1  3  4
1  3  4
1  3  4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Not the fastest (not the slowest either) but the shortest solution so far.

#Build a index array and extract the rows to build the desired new df. This handles index and data all at once.    
df.iloc[np.repeat(df.index,3)]

Out[270]: In [271]: 
   a  b
0  1  2
0  1  2
0  1  2
1  3  4
1  3  4
1  3  4
Allen Qin
  • 19,507
  • 8
  • 51
  • 67