Replicating rows in a pandas data frame by a column value

Question

I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.

import pandas as pd

what_i_have = pd.DataFrame(data={
  'id': ['A', 'B', 'C'],
  'n' : [  1,   2,   3],
  'v' : [ 10,  13,   8]
})

what_i_want = pd.DataFrame(data={
  'id': ['A', 'B', 'B', 'C', 'C', 'C'],
  'v' : [ 10,  13,  13,   8,   8,   8]
})

Is this possible?

Is there any reason to do so? I think data duplication is something best avoided. — greole, Nov 06 '14 at 11:06
Well this is an intermediate step- I am generating the "v" column according to a probability distribution, and then I will add another column by randomly selecting rows from another dataset. — Mersenne Prime, Nov 06 '14 at 11:34
Still I don't see a reason not to do it directly. But I would need more information on what you are actually trying to achieve. — greole, Nov 06 '14 at 11:59
Not sure what you mean by "do it directly" here? The accepted answer is exactly what I am looking for, though. — Mersenne Prime, Nov 06 '14 at 12:09
I am using this intermediate step before converting factors to dummy variables. — user2165, Dec 04 '15 at 06:51

score 51 · Accepted Answer · edited Nov 20 '21 at 04:03

51

You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:

df2 = df.loc[df.index.repeat(df.n)]

  id  n   v
0  A  1  10
1  B  2  13
1  B  2  13
2  C  3   8
2  C  3   8
2  C  3   8

Or you could use np.repeat to get the repeated indices and then use that to index into the frame:

df2 = df.loc[np.repeat(df.index.values, df.n)]

  id  n   v
0  A  1  10
1  B  2  13
1  B  2  13
2  C  3   8
2  C  3   8
2  C  3   8

After which there's only a bit of cleaning up to do:

df2 = df2.drop("n", axis=1).reset_index(drop=True)

  id   v
0  A  10
1  B  13
2  B  13
3  C   8
4  C   8
5  C   8

Note that if you might have duplicate indices to worry about, you could use .iloc instead:

df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)

  id   v
0  A  10
1  B  13
2  B  13
3  C   8
4  C   8
5  C   8

which uses the positions, and not the index labels.

edited Nov 20 '21 at 04:03

Henry Ecker

34,399
18
41
57

answered Nov 06 '14 at 11:42

DSM

342,061
65
592
494

10

With newer version, can be `df.loc[df.index.repeat(df.n)]` – Zero Sep 09 '17 at 18:37
@zero: this should be the new accepted answer – 576i Sep 21 '21 at 10:40
If you get error `ValueError: repeats may not contain negative values.` then try `df.loc[df.index.repeat(df.n.fillna(value=0))]` (repeat doesn't like nan) – James_SO Sep 14 '22 at 20:31

score 2 · Answer 2 · answered Sep 09 '17 at 18:29

You could use set_index and repeat

In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
  id   v
0  A  10
1  B  13
2  B  13
3  C   8
4  C   8
5  C   8

Details

In [1058]: df
Out[1058]:
  id  n   v
0  A  1  10
1  B  2  13
2  C  3   8

score 1 · Answer 3 · answered Mar 06 '21 at 21:52

It's something like the uncount in tidyr:

https://tidyr.tidyverse.org/reference/uncount.html

I wrote a package (https://github.com/pwwang/datar) that implements this API:

from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount

what_i_have = tribble(
    f.id, f.n, f.v,
    'A',  1,   10,
    'B',  2,   13,
    'C',  3,   8
)
what_i_have >> uncount(f.n)

Output:

score 0 · Answer 4 · answered Jul 19 '20 at 10:44

Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():

df.reindex(df.index.repeat(df.n)).drop('n', axis=1)

Output:

You can further append .reset_index(drop=True) to reset the .index.

Replicating rows in a pandas data frame by a column value

4 Answers4

Linked

Related