4

I'd like to divide a certain number of items over multiple rows. Every row should get at least 1, but the rest according to their required share, until all items have been distributed. Lets say we have 6 available, I'd like to get the result as follows.

Using max(1, factor * available) doesn't necessarily make up to the total number of available items.

Is there a way? I have the data in a spark environment, so the coolest method would be a pyspark or even pandas/numpy solution. It can quite easily be done in a python loop obviously.

Input: Total available 6

+---+-------------+
| c1|       factor|
+---+-------------+
|  A|        0.001|
|  B|          0.2|
|  C|          0.2|
|  D|          0.2|
|  E|          0.3|
+---+-------------+

Expected output:

+---+-------------+---------+
| c1|       factor|   result|
+---+-------------+---------+
|  A|        0.001|        1|
|  B|          0.2|        1|
|  C|          0.2|        1|
|  D|          0.2|        1|
|  E|          0.3|        2|
+---+-------------+---------+
  • Why do you say that it wouldn't work? Is the input supposed to be a user input? Or where does it come from? Simply defining the column doesn't work? `df['factor'] = max(1,df['factor'] * available)` and `available = int(input('available = '))`. You can rescale `factor` so that it sums always `1` hence assuring your `available` get's fully distributed and proportional to the `factor` value. – Celius Stingher Jan 17 '20 at 14:09

2 Answers2

1

This is how I would approach the issue if I am understanding it correctly:

import numpy as np
import pandas as pd 

data = {'c1':['A','B','C','D','E'],'factor':[0.001,0.2,0.2,0.2,0.3]}
df = pd.DataFrame(data)
df['factor_rescaled'] = df['factor'] / df['factor'].sum()
available = int(input('Available = '))
df['result'] = np.where(round(df['factor_rescaled']*available,0) < 1,1,round(df['factor_rescaled']*available,0))
print(df)

Passing a value of 6 yields the following result:

  c1  factor  factor_rescaled  result
0  A   0.001         0.001110     1.0
1  B   0.200         0.221976     1.0
2  C   0.200         0.221976     1.0
3  D   0.200         0.221976     1.0
4  E   0.300         0.332963     2.0

The rescaling is done to ensure that if the factors are too high or low, the available value is fully distributed and no over/under distributed. Finally I'm not sure where this available number comes from so I just placed it as an input, which for the example is 6.

Final edit: Using np.where because of Why is np.where faster than pd.apply

Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
0

You can do it this way using pandas:

>>> df
  c1  factor
0  A   0.001
1  B   0.200
2  C   0.200
3  D   0.200
4  E   0.300
>>> available, s = 6, df.factor.sum()
>>> df['result'] = df.factor.apply(lambda x: round(max(1, (x/s)*available)))
>>> df
  c1  factor  result
0  A   0.001     1.0
1  B   0.200     1.0
2  C   0.200     1.0
3  D   0.200     1.0
4  E   0.300     2.0
Vicrobot
  • 3,795
  • 1
  • 17
  • 31
  • The problem with this is that factors don't sum 1, so if you use available = 1000 then it won't get fully distributed. df['result'].sum() will be `901` and not `1000` as it supposed. – Celius Stingher Jan 17 '20 at 14:30
  • @CeliusStingher thanks for mentioning the problem. I modified the solution. – Vicrobot Jan 17 '20 at 14:53