How can I replicate rows of a Pandas DataFrame?

Question

My pandas dataframe looks like this:

   Person  ID   ZipCode   Gender
0  12345   882  38182     Female
1  32917   271  88172     Male
2  18273   552  90291     Female

I want to replicate every row 3 times and reset the index to get:

   Person  ID   ZipCode   Gender
0  12345   882  38182     Female
1  12345   882  38182     Female
2  12345   882  38182     Female
3  32917   271  88172     Male
4  32917   271  88172     Male
5  32917   271  88172     Male
6  18273   552  90291     Female
7  18273   552  90291     Female
8  18273   552  90291     Female

I tried solutions such as:

pd.concat([df[:5]]*3, ignore_index=True)

And:

df.reindex(np.repeat(df.index.values, df['ID']), method='ffill')

But none of them worked.

I think the index is auto generated. No way to change that unless you make it a field of your dataframe. Anyway it's an index. Got to be unique. — J...S, Jun 10 '18 at 22:37
`pd.concat([df[:5]]*3, ignore_index=True)` is working for me, can you show your `df.index` , if there's something up with your index, solutions below might not work. — pyeR_biz, Jun 10 '18 at 22:42
Sorry I'll clarify, `pd.concat([df[:5]]*3, ignore_index=True)` works but it adds the rows to the end of the dataframe, instead of having 3 duplicate lines one after the other ` — DasVisual, Jun 10 '18 at 22:48

score 117 · Accepted Answer · edited Feb 24 '23 at 21:13

Solutions:

Use `np.repeat`:

Version 1:

Try using np.repeat:

newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
print(newdf)

The above code will output:

  Person   ID ZipCode  Gender
0  12345  882   38182  Female
1  12345  882   38182  Female
2  12345  882   38182  Female
3  32917  271   88172    Male
4  32917  271   88172    Male
5  32917  271   88172    Male
6  18273  552   90291  Female
7  18273  552   90291  Female
8  18273  552   90291  Female

np.repeat repeats the values of df, 3 times.

Then we add the columns with assigning new_df.columns = df.columns.

Version 2:

You could also assign the column names in the first line, like below:

newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0  12345  882   38182  Female
1  12345  882   38182  Female
2  12345  882   38182  Female
3  32917  271   88172    Male
4  32917  271   88172    Male
5  32917  271   88172    Male
6  18273  552   90291  Female
7  18273  552   90291  Female
8  18273  552   90291  Female

Version 3:

You could shorten it with loc and only repeat the index, like below:

newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0  12345  882   38182  Female
1  12345  882   38182  Female
2  12345  882   38182  Female
3  32917  271   88172    Male
4  32917  271   88172    Male
5  32917  271   88172    Male
6  18273  552   90291  Female
7  18273  552   90291  Female
8  18273  552   90291  Female

I use reset_index to replace the index with monotonic indexes (0, 1, 2, 3, 4...).

Without `np.repeat`:

Version 4:

You could use the built-in pd.Index.repeat function, like the below:

newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0  12345  882   38182  Female
1  12345  882   38182  Female
2  12345  882   38182  Female
3  32917  271   88172    Male
4  32917  271   88172    Male
5  32917  271   88172    Male
6  18273  552   90291  Female
7  18273  552   90291  Female
8  18273  552   90291  Female

Remember to add reset_index to line-up the index.

Version 5:

Or by using concat with sort_index, like below:

newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0  12345  882   38182  Female
1  12345  882   38182  Female
2  12345  882   38182  Female
3  32917  271   88172    Male
4  32917  271   88172    Male
5  32917  271   88172    Male
6  18273  552   90291  Female
7  18273  552   90291  Female
8  18273  552   90291  Female

Version 6:

You could also use loc with Python list multiplication and sorted, like below:

newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0  12345  882   38182  Female
1  12345  882   38182  Female
2  12345  882   38182  Female
3  32917  271   88172    Male
4  32917  271   88172    Male
5  32917  271   88172    Male
6  18273  552   90291  Female
7  18273  552   90291  Female
8  18273  552   90291  Female

Timings:

Timing with the following code:

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame({'Person': {0: 12345, 1: 32917, 2: 18273}, 'ID': {0: 882, 1: 271, 2: 552}, 'ZipCode': {0: 38182, 1: 88172, 2: 90291}, 'Gender': {0: 'Female', 1: 'Male', 2: 'Female'}})

def version1():
    newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
    newdf.columns = df.columns
    
def version2():
    newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)

    
def version3():
    newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)

    
def version4():
    newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)

    
def version5():
    newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)

    
def version6():
    newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)
    
print('Version 1 Speed:', timeit.timeit('version1()', 'from __main__ import version1', number=20000))
print('Version 2 Speed:', timeit.timeit('version2()', 'from __main__ import version2', number=20000))
print('Version 3 Speed:', timeit.timeit('version3()', 'from __main__ import version3', number=20000))
print('Version 4 Speed:', timeit.timeit('version4()', 'from __main__ import version4', number=20000))
print('Version 5 Speed:', timeit.timeit('version5()', 'from __main__ import version5', number=20000))
print('Version 6 Speed:', timeit.timeit('version6()', 'from __main__ import version6', number=20000))

Output:

Version 1 Speed: 9.879425965991686
Version 2 Speed: 7.752138633004506
Version 3 Speed: 7.078321029010112
Version 4 Speed: 8.01169377300539
Version 5 Speed: 19.853051771002356
Version 6 Speed: 9.801617017001263

We can see that Versions 2 & 3 are faster than the others, the reason for this is because they both use the np.repeat function, and numpy functions are very fast because they are implemented with C.

Version 3 wins against Version 2 marginally due to the usage of loc instead of DataFrame.

Version 5 is significantly slower because of the functions concat and sort_index, since concat copies DataFrames quadratically, which takes longer time.

Fastest Version: Version 3.

So you offered a +100 bounty to get votes for the winter bash hat, but eventually you never awarded the bounty to anyone? That's so cynical :D — mozway, Jan 07 '23 at 09:56
Versions 1 and 2 lose the dtypes: all columns get converted to `object`. IIRC, it's because NumPy arrays need to have a single dtype. Why use those versions anyway if version 3 is faster? — wjandrea, Feb 24 '23 at 21:19
Versions 5 and 6 will only work if the input index is already in sorted order, right? — wjandrea, Feb 24 '23 at 21:23

score 18 · Answer 2 · answered Jun 10 '18 at 22:53

18

These will repeat the indices and preserve the columns as op demonstrated

`iloc` version 1

df.iloc[np.arange(len(df)).repeat(3)]

`iloc` version 2

df.iloc[np.arange(len(df) * 3) // 3]

answered Jun 10 '18 at 22:53

piRSquared

285,575
57
475
624

1

This works like a charm for Dataframes with MultiIndex values, which did not seem to be the case with the accepted solution. The latter could not handle MultiIndexing. – HarryS Jun 08 '21 at 08:31

score 16 · Answer 3 · edited Jan 03 '21 at 04:20

16

Using concat:

pd.concat([df]*3).sort_index()
Out[129]: 
   Person   ID  ZipCode  Gender
0   12345  882    38182  Female
0   12345  882    38182  Female
0   12345  882    38182  Female
1   32917  271    88172    Male
1   32917  271    88172    Male
1   32917  271    88172    Male
2   18273  552    90291  Female
2   18273  552    90291  Female
2   18273  552    90291  Female

edited Jan 03 '21 at 04:20

ppwater

2,315
4
15
29

answered Jun 11 '18 at 00:27

BENY

317,841
20
164
234

score 7 · Answer 4 · answered Mar 26 '22 at 14:50

I'm not sure why this was never proposed, but you can easily use df.index.repeat in conjection with .loc:

new_df = df.loc[df.index.repeat(3)]

Output:

>>> new_df
   Person   ID  ZipCode  Gender
0   12345  882    38182  Female
0   12345  882    38182  Female
0   12345  882    38182  Female
1   32917  271    88172    Male
1   32917  271    88172    Male
1   32917  271    88172    Male
2   18273  552    90291  Female
2   18273  552    90291  Female
2   18273  552    90291  Female

score 4 · Answer 5 · edited Aug 20 '21 at 18:15

4

You can try the following code:

df = df.iloc[df.index.repeat(3),:].reset_index()

df.index.repeat(3) will create a list where each index value will be repeated 3 times and df.iloc[df.index.repeat(3),:] will help generate a dataframe with the rows as exactly returned by this list.

edited Aug 20 '21 at 18:15

Flair

2,609
1
29
41

answered Aug 20 '21 at 14:58

mahesha sahoo

49
2

score 3 · Answer 6 · answered Jun 10 '18 at 22:41

You can do it like this.

def do_things(df, n_times):
    ndf = df.append(pd.DataFrame({'name' : np.repeat(df.name.values, n_times) }))
    ndf = ndf.sort_values(by='name')
    ndf = ndf.reset_index(drop=True)
    return ndf

if __name__ == '__main__':
    df = pd.DataFrame({'name' : ['Peter', 'Quill', 'Jackson']}) 
    n_times = 3
    print do_things(df, n_times)

And with explanation...

import pandas as pd
import numpy as np

n_times = 3
df = pd.DataFrame({'name' : ['Peter', 'Quill', 'Jackson']})
#       name
# 0    Peter
# 1    Quill
# 2  Jackson

#   Duplicating data.
df = df.append(pd.DataFrame({'name' : np.repeat(df.name.values, n_times) }))
#       name
# 0    Peter
# 1    Quill
# 2  Jackson
# 0    Peter
# 1    Peter
# 2    Peter
# 3    Quill
# 4    Quill
# 5    Quill
# 6  Jackson
# 7  Jackson
# 8  Jackson

#   The DataFrame is sorted by 'name' column.
df = df.sort_values(by=['name'])
#       name
# 2  Jackson
# 6  Jackson
# 7  Jackson
# 8  Jackson
# 0    Peter
# 0    Peter
# 1    Peter
# 2    Peter
# 1    Quill
# 3    Quill
# 4    Quill
# 5    Quill

#   Reseting the index.
#   You can play with drop=True and drop=False, as parameter of `reset_index()`
df = df.reset_index()
#     index     name
# 0       2  Jackson
# 1       6  Jackson
# 2       7  Jackson
# 3       8  Jackson
# 4       0    Peter
# 5       0    Peter
# 6       1    Peter
# 7       2    Peter
# 8       1    Quill
# 9       3    Quill
# 10      4    Quill
# 11      5    Quill

score 1 · Answer 7 · answered Sep 14 '22 at 20:51

If you need to index your repeats (e.g. for a multi-index) and also base the number of repeats on a value in a column, you can do this:

someDF["RepeatIndex"] = someDF["RepeatBasis"].fillna(value=0).apply(lambda x: list(range(int(x))) if x > 0 else [])
superDF = someDF.explode("RepeatIndex").dropna(subset="RepeatIndex")

This gives a DataFrame in which each record is repeated however many times is indicated in the "RepeatBasis" column. The DataFrame also gets a "RepeatIndex" column, which you can combine with the existing index to make into a multi-index, preserving index uniqueness.

If anyone's wondering why you'd want to do such a thing, in my case it's when I get data in which frequencies have already been summarized and for whatever reason, I need to work with singular observations. (think of reverse-engineering a histogram)

FirefoxMetzger · Answer 8 · 2022-12-13T08:31:26.037

This question doesn't have enough answers yet! Here are some more ways to do this that are still missing and that allow chaining :)

# SQL-style cross-join
# (one line and counts replicas)
(
    data
    .join(pd.DataFrame(range(3), columns=["replica"]), how="cross")
    .drop(columns="replica")  # remove if you want to count replicas
)

# DataFrame.apply + Series.repeat
# (most readable, but potentially slow)
(
    data
    .apply(lambda x: x.repeat(3))
    .reset_index(drop=True)
)

# DataFrame.explode
# (fun to have explosions in your code)
(
    data
    .assign(replica=lambda df: [[x for x in range(3)]] * len(df))
    .explode("replica", ignore_index=True)
    .drop(columns="replica")  # or keep if you want to know which copy it is
)

(Edit: On a more serious note, using explode is useful if you need to count replicas and have a dynamic replica count per row. For example, if you have per-customer usage data with a start and end date, you can use the above to transform the data into monthly per-customer usage data.)

And of course here is the snippet to create the data for testing:

data = pd.DataFrame([
        [12345, 882, 38182, "Female"],
        [32917, 271, 88172, "Male"],
        [18273, 552, 90291, "Female"],
    ],
    columns=["Person", "ID", "ZipCode", "Gender"]
)

score 1 · Answer 9 · answered Dec 27 '22 at 13:01

Use pd.concat: create three of the same dataFrames and merge them together, doesn't use a lot of code:

df = pd.concat([df]*3, ignore_index=True)

print(df)

   Person  ID   ZipCode   Gender
0  12345   882  38182     Female
1  12345   882  38182     Female
2  12345   882  38182     Female
3  32917   271  88172     Male
4  32917   271  88172     Male
5  32917   271  88172     Male
6  18273   552  90291     Female
7  18273   552  90291     Female
8  18273   552  90291     Female

Note: ignore_index=True makes the index reset.

score 1 · Answer 10 · answered Dec 28 '22 at 19:23

Could also use np.tile()

df.loc[np.tile(df.index,3)].sort_index().reset_index(drop=True)

Output:

   Person   ID  ZipCode  Gender
0   12345  882    38182  Female
1   12345  882    38182  Female
2   12345  882    38182  Female
3   32917  271    88172    Male
4   32917  271    88172    Male
5   32917  271    88172    Male
6   18273  552    90291  Female
7   18273  552    90291  Female
8   18273  552    90291  Female

How can I replicate rows of a Pandas DataFrame?

10 Answers10

Solutions:

Use `np.repeat`:

Version 1:

Version 2:

Version 3:

Without `np.repeat`:

Version 4:

Version 5:

Version 6:

Timings:

`iloc` version 1

`iloc` version 2

Linked

Related

How can I replicate rows of a Pandas DataFrame?

10 Answers10

Solutions:

Use np.repeat:

Version 1:

Version 2:

Version 3:

Without np.repeat:

Version 4:

Version 5:

Version 6:

Timings:

iloc version 1

iloc version 2

Linked

Related

Use `np.repeat`:

Without `np.repeat`:

`iloc` version 1

`iloc` version 2