43

I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.

import pandas as pd

what_i_have = pd.DataFrame(data={
  'id': ['A', 'B', 'C'],
  'n' : [  1,   2,   3],
  'v' : [ 10,  13,   8]
})

what_i_want = pd.DataFrame(data={
  'id': ['A', 'B', 'B', 'C', 'C', 'C'],
  'v' : [ 10,  13,  13,   8,   8,   8]
})

Is this possible?

Mersenne Prime
  • 565
  • 1
  • 4
  • 5
  • Is there any reason to do so? I think data duplication is something best avoided. – greole Nov 06 '14 at 11:06
  • Well this is an intermediate step- I am generating the "v" column according to a probability distribution, and then I will add another column by randomly selecting rows from another dataset. – Mersenne Prime Nov 06 '14 at 11:34
  • Still I don't see a reason not to do it directly. But I would need more information on what you are actually trying to achieve. – greole Nov 06 '14 at 11:59
  • 1
    Not sure what you mean by "do it directly" here? The accepted answer is exactly what I am looking for, though. – Mersenne Prime Nov 06 '14 at 12:09
  • I am using this intermediate step before converting factors to dummy variables. – user2165 Dec 04 '15 at 06:51

4 Answers4

51

You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:

df2 = df.loc[df.index.repeat(df.n)]

  id  n   v
0  A  1  10
1  B  2  13
1  B  2  13
2  C  3   8
2  C  3   8
2  C  3   8

Or you could use np.repeat to get the repeated indices and then use that to index into the frame:

df2 = df.loc[np.repeat(df.index.values, df.n)]

  id  n   v
0  A  1  10
1  B  2  13
1  B  2  13
2  C  3   8
2  C  3   8
2  C  3   8

After which there's only a bit of cleaning up to do:

df2 = df2.drop("n", axis=1).reset_index(drop=True)

  id   v
0  A  10
1  B  13
2  B  13
3  C   8
4  C   8
5  C   8

Note that if you might have duplicate indices to worry about, you could use .iloc instead:

df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)

  id   v
0  A  10
1  B  13
2  B  13
3  C   8
4  C   8
5  C   8

which uses the positions, and not the index labels.

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
DSM
  • 342,061
  • 65
  • 592
  • 494
2

You could use set_index and repeat

In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
  id   v
0  A  10
1  B  13
2  B  13
3  C   8
4  C   8
5  C   8

Details

In [1058]: df
Out[1058]:
  id  n   v
0  A  1  10
1  B  2  13
2  C  3   8
Zero
  • 74,117
  • 18
  • 147
  • 154
1

It's something like the uncount in tidyr:

https://tidyr.tidyverse.org/reference/uncount.html

I wrote a package (https://github.com/pwwang/datar) that implements this API:

from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount

what_i_have = tribble(
    f.id, f.n, f.v,
    'A',  1,   10,
    'B',  2,   13,
    'C',  3,   8
)
what_i_have >> uncount(f.n)

Output:

  id   v
0  A  10
1  B  13
1  B  13
2  C   8
2  C   8
2  C   8
Panwen Wang
  • 3,573
  • 1
  • 18
  • 39
0

Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():

df.reindex(df.index.repeat(df.n)).drop('n', axis=1)

Output:


   id   v
0   A   10
1   B   13
1   B   13
2   C   8
2   C   8
2   C   8

You can further append .reset_index(drop=True) to reset the .index.

Chenglong Ma
  • 345
  • 3
  • 13