1

I know that typically replication of rows is horrible for performance, which is why most answers on Stackoverflow don't explain how to actually do it but suggest better alternatives - but for my use case, I need to actually do that.

I have a table with replication weights,

   id   some_value weight
    1            2      5
    2            A      2
    3            B      1
    4            3      3

where I need to repeat each row by the weight value. Think of a huge data frame. What would be a very efficient way to achieve this?

Expected output:

   id   some_value weight
    1            2      5
    1            2      5
    1            2      5
    1            2      5
    1            2      5
    2            A      2
    2            A      2
    3            B      1
    4            3      3
    4            3      3
    4            3      3
FooBar
  • 15,724
  • 19
  • 82
  • 171
  • Can you throw more light on what you really want to do? This isn't enough. You can show codes you've tried. – Yax Nov 08 '14 at 21:51
  • 2
    You should be able to use `loc` and `np.repeat`, as done [here](http://stackoverflow.com/questions/26777832/replicating-rows-in-a-pandas-data-frame-by-a-column-value/26778637#26778637)-- could you confirm that I'm reading you goal correctly? If so, I can close as a dup. – DSM Nov 08 '14 at 21:54
  • @DSM I am aware of that post, but I am asking for the most (very) efficient way. I thought that perhaps there was a way of generating a second df with the correct new index and then fill it up somehow which would make this process faster. – FooBar Nov 11 '14 at 19:44

3 Answers3

2

Here are two ways

1) Using set_index and repeat

In [1070]: df.set_index(['id', 'some_value'])['weight'].repeat(df['weight']).reset_index()
Out[1070]:
    id some_value  weight
0    1          2       5
1    1          2       5
2    1          2       5
3    1          2       5
4    1          2       5
5    2          A       2
6    2          A       2
7    3          B       1
8    4          3       3
9    4          3       3
10   4          3       3

2) Using .loc and .repeat

In [1071]: df.loc[df.index.repeat(df.weight)].reset_index(drop=True)
Out[1071]:
    id some_value  weight
0    1          2       5
1    1          2       5
2    1          2       5
3    1          2       5
4    1          2       5
5    2          A       2
6    2          A       2
7    3          B       1
8    4          3       3
9    4          3       3
10   4          3       3

Details

In [1072]: df
Out[1072]:
   id some_value  weight
0   1          2       5
1   2          A       2
2   3          B       1
3   4          3       3
Zero
  • 74,117
  • 18
  • 147
  • 154
0

Perhaps treat it like a weighted array:

def weighted_array(arr, weights):
     zipped = zip(arr, weights)
     weighted_arr = []
     for i in zipped:
         for j in range(i[1]):
             weighted_arr.append(i[0])
     return weighted_arr

The returned weighted_arr will have each element in arr, repeated 'weights' number of times.

user308827
  • 21,227
  • 87
  • 254
  • 417
0

It's something like the uncount in tidyr:

https://tidyr.tidyverse.org/reference/uncount.html

I wrote a package (https://github.com/pwwang/datar) that implements this API:

from datar import f
from datar.tibble import tibble
from datar.tidyr import uncount

df = tibble(
  id=range(1,5),
  some_value=[2,'A','B',3],
  weight=[5,2,1,3]
)
df >> uncount(f.weight, _remove=False)

Output:

   id some_value  weight
0   1          2       5
0   1          2       5
0   1          2       5
0   1          2       5
0   1          2       5
1   2          A       2
1   2          A       2
2   3          B       1
3   4          3       3
3   4          3       3
3   4          3       3
Panwen Wang
  • 3,573
  • 1
  • 18
  • 39