0

I have following data in CSV:

zipcode,name,type,counts
1234,alex,update,23
2134,harry,update,17
1135,tom,update,78
1136,helen,update,82
1175,jack,update,98
6987,ben,update,110
7123,peter,update,109
1126,jill,update,82
1195,parker,update,98
6587,edward,update,110
7013,dani,update,109
1013,jeff,update,88

What I am trying to achieve is set a batch number 1 to 5 to each row such that we will have near about equal sum of counts in each batch. What should be the optimal approach?

Result output:

zipcode,name,type,counts,batch
1234,alex,update,23,b1
2134,harry,update,17,b1
1135,tom,update,78,b1
1136,helen,update,82, b1
1175,jack,update,98, b2
6987,ben,update,110,b2
7123,peter,update,109,b3
1126,jill,update,82,b3
1195,parker,update,98,b4
6587,edward,update,110,b4
7013,dani,update,109,b5
1013,jeff,update,88,b5
martineau
  • 119,623
  • 25
  • 170
  • 301
Vivek Kumar
  • 53
  • 1
  • 5
  • Sorry, I see what you mean. Deleted the misleading comment. How about something like this? Keep a running total count per batch, which you update with every row you process. Add each row to the batch with the lowest running total count. That doesn't necessarily get you the 100% optimal solution, but is it good enough? – slothrop Jul 14 '22 at 16:57
  • If you want something with potentially better results, but slower: https://stackoverflow.com/questions/22264417/greedy-algorithm-for-dividing-a-region-of-numbers-into-partitions-with-the-most – slothrop Jul 14 '22 at 17:01

0 Answers0