Solutions:
Version 1:
Try using np.repeat
:
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
print(newdf)
The above code will output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
np.repeat
repeats the values of df
, 3
times.
Then we add the columns with assigning new_df.columns = df.columns
.
Version 2:
You could also assign the column names in the first line, like below:
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Version 3:
You could shorten it with loc
and only repeat the index, like below:
newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
I use reset_index
to replace the index with monotonic indexes (0, 1, 2, 3, 4...
).
Version 4:
You could use the built-in pd.Index.repeat
function, like the below:
newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Remember to add reset_index
to line-up the index
.
Version 5:
Or by using concat
with sort_index
, like below:
newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Version 6:
You could also use loc
with Python list
multiplication and sorted
, like below:
newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Timings:
Timing with the following code:
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame({'Person': {0: 12345, 1: 32917, 2: 18273}, 'ID': {0: 882, 1: 271, 2: 552}, 'ZipCode': {0: 38182, 1: 88172, 2: 90291}, 'Gender': {0: 'Female', 1: 'Male', 2: 'Female'}})
def version1():
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
def version2():
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
def version3():
newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)
def version4():
newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)
def version5():
newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)
def version6():
newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)
print('Version 1 Speed:', timeit.timeit('version1()', 'from __main__ import version1', number=20000))
print('Version 2 Speed:', timeit.timeit('version2()', 'from __main__ import version2', number=20000))
print('Version 3 Speed:', timeit.timeit('version3()', 'from __main__ import version3', number=20000))
print('Version 4 Speed:', timeit.timeit('version4()', 'from __main__ import version4', number=20000))
print('Version 5 Speed:', timeit.timeit('version5()', 'from __main__ import version5', number=20000))
print('Version 6 Speed:', timeit.timeit('version6()', 'from __main__ import version6', number=20000))
Output:
Version 1 Speed: 9.879425965991686
Version 2 Speed: 7.752138633004506
Version 3 Speed: 7.078321029010112
Version 4 Speed: 8.01169377300539
Version 5 Speed: 19.853051771002356
Version 6 Speed: 9.801617017001263
We can see that Versions 2 & 3 are faster than the others, the reason for this is because they both use the np.repeat
function, and numpy
functions are very fast because they are implemented with C.
Version 3 wins against Version 2 marginally due to the usage of loc
instead of DataFrame
.
Version 5 is significantly slower because of the functions concat
and sort_index
, since concat
copies DataFrame
s quadratically, which takes longer time.
Fastest Version: Version 3.