5

I currently have a data set where im trying to group up rows based on a column and sum the columns where the values are integers.

However, the catch is I would like to create a new row once the sum has reached a certain threshhold

For example, in the below dataframe, I am trying to group the rows based on company name and sum up the weights, however, I do not want weight to exceed 100.

Input dataframe:

Company Weight
a 30
b 45
a 27
a 40
b 57
a 57
b 32

Output dataframe:

Company Weight
a 97
a 57
b 89
b 45

I have tried using group by and sum, however, it cannot detect whether or not I have reached a maximum amount.

Is there any way I can achieve this?

Any help would be greatly appreciated!

2 Answers2

4

I think here are necessary loops, so for improve performance is use numba, modified solution from Divakar, called function per groups by GroupBy.transform and then aggregate sum:

from numba import njit

@njit
def make_groups(x, target):
    result = np.empty(len(x),dtype=np.uint64)
    total = 0
    group = 0
    for i,x_i in enumerate(x):
        total += x_i
        if total >= target:
            group += 1
            total = 0
        result[i] = group
    return result

g = df.groupby("Company")["Weight"].transform(lambda x: make_groups(x.to_numpy(), 100))

df1 = (df.groupby(by=["Company", g])
        .sum()
        .reset_index(1, drop=True)
        .sort_values(['Company','Weight'], ascending=[True, False])
        .reset_index())
print (df1)
  Company  Weight
0       a      97
1       a      57
2       b      89
3       b      45
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I did something similar, but it does not match the order of rows in OPs output. If you are allowed to reorder rows, you can get some weights closer to 100, which is what OP wants. – hilberts_drinking_problem Jun 03 '21 at 07:10
1

well, it depends, you're asking an NP problem currently unless you don't want the optimum weight in under 100, there are a few algoritems you can do,

but none are o(n) which is what group by and the sum does, lets say you iterate with iterrows() (try to avoid that), would you be able to do so in one iteration? if you are not looking for an optimum solution (closest to 100 each match) there is an option.

for every company, you have to sort it by increasing values. using iteration to open a new row every time sum is reaching a 100, at a side variable, and replacing the origin at the end

There isn't a pandas / Numpy standard solution that I know of.

masasa
  • 260
  • 1
  • 9
  • 1
    That's actually what i'm trying right now! Thank you for the suggestion! I was mainly hoping for a faster option in terms of time – ChrisHo1341 Jun 03 '21 at 07:06
  • if you don't have a memory issue, and you know that lets say the minimum weight is 10, which means each sum is capped by 10 sums, maybe you can do 10 arrays each having one less element, sum using NumPy as roll between solutions, on large scale data, it might be more efficient. but that's only going to work if you have a cap. – masasa Jun 03 '21 at 07:10