grouping a list of items by cumulative size, with a limit on size of the groups

Question

I am trying to come up with a way to group together a list of [name, size] pairs. The idea is to group them together so that each group is as close as possible to a maximum_group_size, but without breaking a maximum_group_length cutoff.

For example, if maximum_group_size = 10, and maximum_group_length = 3, then the following simple dataset could be grouped as so:

data = [['A', 3], ['B', 6], ['C', 7], ['D', 1]]
grouped: [[['A', 3], ['B', 6]], [['C', 7], ['D', 1]]]

But this data set would end up as:

data = [['A', 1], ['B', 1], ['C', 1], ['D', 1]]
grouped: [[['A', 1], ['B', 1], ['C', 1]], ['D', 1]]]

The input data are unsorted, but could easily be sorted. My intuition was not to do so because sorting would lead to the largest number of groups that are not full.

How could I go about (prettily) doing this? My current implementation is not very elegant:

def group_references(sam_handle, num_bases=10 ** 7, max_seqs=100):
    """
    Group up references by num_bases, unless that exceeds max_seqs.
    """
    name_iter = itertools.izip(*[sam_handle.references, sam_handle.lengths])
    name, size = name_iter.next()
    this_bin = [[name, size]]
    bin_base_count = size
    num_seqs = 1
    for name, size in name_iter:
        bin_base_count += size
        num_seqs += 1
        if bin_base_count >= num_bases or num_seqs > max_seqs:
            yield this_bin
            this_bin = [[name, size]]
            bin_base_count = size
            num_seqs = 1
        else:
            this_bin.append([name, size])

In your first example dataset, the first group size is 3 + 9 = 12, which is greater than the maximum group size 10. Is it a typo in your example? — Eric Citaire, Nov 22 '15 at 23:17
@IanFiddes In the second example group_length is 4 (> maximum_group_length=3) — ElKamina, Nov 23 '15 at 05:27

score 2 · Answer 1 · answered Nov 23 '15 at 05:31

You can prove that if you can solve this optimally in polynomial time, you could solve the knapsack problem in polynomial time as well, which is impossible. Therefore, the solution is non-polynomial.

Anyways, I suggest looking at various approximate solutions for the knapsack problem and pick the one that is most appropriate for your case.

score 0 · Answer 2 · edited May 23 '17 at 11:44

0

Your problem is a Bin packing problem or Packing problem.

Take a look at a similar question where you can find a piece of code in python.

edited May 23 '17 at 11:44

Community

1
1

answered Nov 23 '15 at 09:08

Eric Citaire

4,355
1
29
49

grouping a list of items by cumulative size, with a limit on size of the groups

2 Answers2