Pandas idiomatic way to do this custom duplicating rows function?

Question

I have got a pd.DataFrame, with each line representing a group of people. They have got an id (several columns in my dataframe, but here summarized by the "id" column of my example dataframe). Each of this group represent several people (column "size").

I am trying to divide theses groups into smaller groups of a maximal size "max_size". For example, if max_size = 5a row with "id" = "foo" and "size" = 13 should be replaced by three rows, all with "id" = "foo" and respective "size" = 5, "size" = 5, and "size" = 3

I have coded an usable function, but am looking for a more pandas idiomatic way to do it, if it exists.

my function is

def custom_duplicating_function(df):
    def aux_custom_duplicating_function(row, max_size=5):
        row = row.to_dict()
        size = row["size"]
        L = [row.copy() for i in range((size // max_size + 1))]
        for i in range(len(L) - 1):
            L[i]["size"] = max_size 
        L[-1]["size"] = size%max_size
        return(pd.DataFrame.from_dict(L))

    temp = df.apply(aux_custom_duplicating_function, axis=1)
    result = pd.concat([temp[i] for i in range(len(temp.index))])
    return(result)

The following dataframe

test = pd.DataFrame.from_dict([{"id":"foo", "size":13},
                     {"id":"bar", "size":17},
                     {"id":"baz", "size":3}])
************
    id  size
0  foo    13
1  bar    17
2  baz     3
************

should be transformed in

    id  size
0  foo     5
1  foo     5
2  foo     3
0  bar     5
1  bar     5
2  bar     5
3  bar     2
0  baz     3

score 3 · Accepted Answer · answered Sep 05 '19 at 10:25

3

Use explode for pandas >=0.25

test['size'] = test['size'].apply(lambda x:[5]*(x//5)+[(x%5)])

test.explode('size')

answered Sep 05 '19 at 10:25

Mark Wang

2,623
7
15

2

i like this answer, clever .. :) +1 – anky Sep 05 '19 at 10:46

score 0 · Answer 2 · answered Sep 05 '19 at 10:24

We could nest the items via apply and then unnest them using e.g. the code from this answer.

import pandas as pd
max_size=5
test = pd.DataFrame.from_dict([{"id":"foo", "size":13},
                     {"id":"bar", "size":17},
                     {"id":"baz", "size":3}])

test['size'] = test['size'].apply(lambda x: [max_size]*(x//max_size)+[x%max_size])
test2 = test.apply(lambda x: pd.Series(x['size']),axis=1).stack().reset_index(level=1, drop=True)
test2.name = 'size'
test.drop('size', axis=1).join(test2)

    id  size
0  foo   5.0
0  foo   5.0
0  foo   3.0
1  bar   5.0
1  bar   5.0
1  bar   5.0
1  bar   2.0
2  baz   3.0

Pandas idiomatic way to do this custom duplicating rows function?

2 Answers2