3

I have a dataframe:

id col1  col2
0  1000   250
1  2000   750
2  1500   350
3  3000   800
4  4500  2500
5  8500  4450
6  6300  1250

I'm trying to find the rows where I can maximize the sum of col2 values, based on/given the total sum for those rows' col1 is <= 15000.

What would be the easiest way to do so?

user9996043
  • 229
  • 1
  • 2
  • 13
  • The keyword you're looking for is **knapsack problem**. – Stef Aug 24 '21 at 20:53
  • @Stef Knapsack seems to be memory heavy, but yes, it does seem to be what I'm looking for. Any specific implementation you would recommend? – user9996043 Aug 25 '21 at 15:38
  • Yes, you can use a dynamic programming approach, which gives a [pseudo-polynomial time](https://en.wikipedia.org/wiki/Pseudo-polynomial_time) algorithm. The idea is to build an array where cell `(w, i)` tells you what value you can achieve using only items up to id `i` and with a weight at most `w`. Cells `(0, i)` and `(w, 0)` are easy to fill; other cells can be filled with a recursive formula using cells you have already filled. The complexity will be proportional to the number of cells of the array, which in your example would be 8*75, since you only need consider weights multiple of 500. – Stef Aug 26 '21 at 08:57
  • I don't know of an already-implemented module for knapsack in python, although there most certainly exist several. You can look them up in google if you don't want to implement it yourself (although it is an interesting exercise). I suggest dividing all the weights in `col1` by 500 if you are going to use someone else's implementation. – Stef Aug 26 '21 at 08:59

1 Answers1

1

As suggested in comments it might be knapsack problem, but I tried to implement what I understand from your requirement below:

Using powerset from itertools with pd.concat.

from itertools import chain, combinations

def powerset(iterable):
   """powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"""
   s = list(iterable)
   print(s)
   return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

df_groups = pd.concat([df.reindex(l).assign(grp=n) for n, l in 
                   enumerate(powerset(df.index))
                  if ((df.loc[list(l), 'col1'].sum() <= 1500))])

print(df_groups)

Output:

     id  col1   col2  grp
 0   0   1000   250   1
 2   2   1500   350   3

Explanation:

We are using the index of the dataframe to create groups of rows using powerset function. Next, we are using enumerate to identify each group and with assign we are creating a new column in a dataframe with that group number from enumerate. Then what we get is groups that satisfy the condition, where the sum of col1.values <= 15000 in that particular group.

Reference: stackoverflow.com/questions/58119575

Priya
  • 723
  • 1
  • 5
  • 7
  • 1
    Nice bruteforce answer, but you should probably mention that it becomes incredibly slow if there are too many rows in the dataset (how many rows?) – Stef Aug 25 '21 at 09:15
  • Yeah, this does seem to be a bit slow. I have a bit less than 3000 rows. – user9996043 Aug 25 '21 at 15:36
  • @Stef yeah, it becomes very slow, but this is the only idea, I got in my mind!...Maybe it could help the OP derive some insight. – Priya Aug 25 '21 at 16:19