Expand the dataset based on the column values

Question

I have a dataframe df1:

Date_1     Date_2       i_count c_book
01/09/2019  02/08/2019  2       204
01/09/2019  03/08/2019  2       211
01/09/2019  04/08/2019  2       218
01/09/2019  05/08/2019  2       226
01/09/2019  06/08/2019  2       234
01/09/2019  07/08/2019  2       242
01/09/2019  08/08/2019  2       251
01/09/2019  09/08/2019  2       259
01/09/2019  10/08/2019  3       269
01/09/2019  11/08/2019  3       278
01/09/2019  12/08/2019  3       288
01/09/2019  13/08/2019  3       298
01/09/2019  14/08/2019  3       308
01/09/2019  15/08/2019  3       319
01/09/2019  16/08/2019  4       330
01/09/2019  17/08/2019  4       342
01/09/2019  18/08/2019  4       354
01/09/2019  19/08/2019  4       366
01/09/2019  20/08/2019  4       379
01/09/2019  21/08/2019  5       392
01/09/2019  22/08/2019  5       406
01/09/2019  23/08/2019  6       420
01/09/2019  24/08/2019  6       435
01/09/2019  25/08/2019  7       450
01/09/2019  26/08/2019  8       466
01/09/2019  27/08/2019  9       483
01/09/2019  28/08/2019  10      500
01/09/2019  29/08/2019  11      517
01/09/2019  30/08/2019  12      535
01/09/2019  31/08/2019  14      554

I want to expand the dataset based on i_count. i_count is the count of rows to be replicated. so lets say if i_count = 2 implies that 2 rows need to be replicated for the same.

Also, I want to create a new column c_book_i such that c_book should be divided within the entries in the dataset. for example, if i_count = 2, signifies that new dataframe should have 2 entries and c_book_i should have 2 entries such that sum(c_book_i) = c_book. The last constraint is that I want to have c_book_i > 10 in all the cases.

so Far :

def f(x):
    i = np.random.random(len(x))
    j = i/sum(i) * x
    return j

joined_df2 = df1.reindex(df1.index.repeat(df1['i_count']))
joined_df2['c_book_i'] = joined_df2.groupby(['Date_1','Date_2'])['c_book'].transform(f)

This provides me the same but without the check that the c_book should be greater than 10. there are a lot of values coming less than 10.

Can anyone help with the same.

Thanks

splash58 · Accepted Answer · 2019-11-10T21:39:27.170

1

On base of the solution

def f(x):
    total = x.iloc[0].astype(int)
    minimum = 10
    dividers = sorted(random.sample(range(minimum, total-minimum, minimum), len(x) - 1))
    return [a - b for a, b in zip(dividers + [total], [0] + dividers)]

How it works. Let the total is 12, and we want to split it to 4 parts with minimum 2. We get a range from 2 to 12-2 with step 2 => [2,4,6,8,10]. Then get any 3 numbers, f.e. 2,4,8 and add borders, so, [0,2,4,8,12]. Now differences between items of that list [2,2,4,4] will get sum 12 (diff between borders) and any from them can't be less than 2

edited Nov 10 '19 at 21:39

answered Nov 08 '19 at 13:36

splash58

26,043
3
22
34

could you explain this once, why did you use sorted with dividers. I would be grateful if you could explain this once to me. – dper Nov 10 '19 at 19:05
have you seen the link? – splash58 Nov 10 '19 at 19:09
Also I am getting this error while applying this function : `TypeError: 'numpy.float64' object cannot be interpreted as an integer` – dper Nov 10 '19 at 19:21
Can you please take a look at the error and suggest changes based on the same. – dper Nov 10 '19 at 19:27
Sure, please edit the answer once you get the time, I am getting the error as I posted in the previous comment. any help would be greatly appreciated. Thanks a lot. – dper Nov 10 '19 at 19:32
Seems, it depends on pandas version. It works for me. I add head() to the answer. At what line you get the error? – splash58 Nov 10 '19 at 20:20
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202127/discussion-between-dper-and-splash58). – dper Nov 10 '19 at 20:36

hirolau · Answer 2 · 2019-11-08T13:06:32.547

0

What about:

def distribute_randomly(array):

    # This is the minimum to give each:
    minimum = 10

    # This means we have to reserve this amount:
    min_value_sum = len(array)*minimum

    # The rest we can distribute:
    to_distribute = array.sum() - min_value_sum

    # Get random values that all sum up to 1:
    random_values = numpy.random.rand(len(array))
    random_values = random_values/random_values.sum()

    # Return the minimum + a part of what is left to distribute
    return random_values*to_distribute + minimum

# Expand rows based on length of i_count:
df1 = df1.join(df1['i_count'].apply(lambda x: range(x)).explode().rename('dummy'))

# transform cbook_ to randomize
df1['c_book_2'] = df1.groupby('i_count')['c_book'].transform(distribute_randomly)

# Finally make sure they are not below 10:
df1['c_book_i'] = df1['c_book_2'].where(df1['c_book_2']>10, 10)

# If needed:
df1 = df1.reset_index()

Edit: Added "random" distribute function.

edited Nov 08 '19 at 13:06

answered Nov 08 '19 at 12:38

hirolau

13,451
8
35
47

Thank you for your answer, however, I donot want to do x/len(x) I want to make it random so that the values are different and not same for c_book_i – dper Nov 08 '19 at 12:39
Is there any way that it can be randomly allocated instead of doing x/len(x) – dper Nov 08 '19 at 12:40
Sure, but you would need some kind of distribution of how you want to randomize it. – hirolau Nov 08 '19 at 12:43
I am not sure if your code would take into account that sum(c_book_i) for a (date_1, date_2) would be equal to c_b for (date_1,date_2). Also any distribution works in that sense, since the analysis is on the aggregated c_b and the values present satisfies that. So I am not very particular about the distribution. it would be great if you could provide some distribution formats in your answer so that it can be achieved. As per my understanding, the lambda needs to be changed only. Also I could not understand the second step explanation... `cbook_` is not present with us. – dper Nov 08 '19 at 12:48
I am not following what you want exactly, but have tried to understand. – hirolau Nov 08 '19 at 13:00
my apologies, I mentioned too much in a single comment, let me divide it – dper Nov 08 '19 at 13:02
1. So there is a constraint that after expanding, for a particular `(date_1, date_2) ` `sum(c_booked_i)` should be equal to c_booked. for example - if there are 2 entries, sum of `c_booked_i` for those 2 entries should be equal to `c_booked` – dper Nov 08 '19 at 13:04
2. I could not understand the second step explanation... `cbook_` is not present with us. – dper Nov 08 '19 at 13:05
What about now? – hirolau Nov 08 '19 at 13:08
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/202053/discussion-between-dper-and-hirolau). – dper Nov 08 '19 at 13:08

Expand the dataset based on the column values

2 Answers2