I am trying to come up with a way to group together a list of [name, size]
pairs. The idea is to group them together so that each group is as close as possible to a maximum_group_size
, but without breaking a maximum_group_length
cutoff.
For example, if maximum_group_size = 10
, and maximum_group_length = 3
, then the following simple dataset could be grouped as so:
data = [['A', 3], ['B', 6], ['C', 7], ['D', 1]]
grouped: [[['A', 3], ['B', 6]], [['C', 7], ['D', 1]]]
But this data set would end up as:
data = [['A', 1], ['B', 1], ['C', 1], ['D', 1]]
grouped: [[['A', 1], ['B', 1], ['C', 1]], ['D', 1]]]
The input data are unsorted, but could easily be sorted. My intuition was not to do so because sorting would lead to the largest number of groups that are not full.
How could I go about (prettily) doing this? My current implementation is not very elegant:
def group_references(sam_handle, num_bases=10 ** 7, max_seqs=100):
"""
Group up references by num_bases, unless that exceeds max_seqs.
"""
name_iter = itertools.izip(*[sam_handle.references, sam_handle.lengths])
name, size = name_iter.next()
this_bin = [[name, size]]
bin_base_count = size
num_seqs = 1
for name, size in name_iter:
bin_base_count += size
num_seqs += 1
if bin_base_count >= num_bases or num_seqs > max_seqs:
yield this_bin
this_bin = [[name, size]]
bin_base_count = size
num_seqs = 1
else:
this_bin.append([name, size])