1

I'm trying to solve an issue with subsection set.

The input data is the list and the integer. The case is to divide a set into N-elements subsets whose sum of element is almost equal. As this is an NP-hard problem I try two approaches: a) iterate all possibilities and distribute it using mpi4py to many machines (the list above 100 elements and 20 element subsets working too long) b) using mpi4py send the list to different seed but in this case I potentially calculate the same set many times. For instance of 100 numbers and 5 subsets with 20 elements each in 60s my result could be easily better by human simply looking for the table.

Finally I'm looking for heuristic algorithm, which could be computing in distributed system and create N-elements subsets from bigger set whose sum is almost equal.

a = [range(12)]
k = 3

One of the possible solution:

[1,2,11,12] [3,4,9,10] [5,6,7,8] 

because sum is 26, 26, 26

Not always it is possible to create exactly the equal sums or number of elements. The difference between maximum and minimum number of elements in sets could be 0 (if len(a)/k is integer) or 1.

edit 1:

I investigate two option: 1. Parent generate all iteration and then send to the parallel algorithm (but this is slow for me). 2. Parent send a list and each node generates own subsets and calculated the subset sum in restricted time. Then send the best result to parent. Parent received this results and choose the best one with minimized the difference between sums in subsets. I think the second option has potential to be faster.

Best regards, Szczepan

Szczepan G
  • 11
  • 3
  • 1
    Possible duplicate of [How do you split a list into evenly sized chunks?](https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks) – code_dredd Jan 18 '19 at 06:39
  • This is differ beacause my request's algorithm needs to be used in distributed system as well as finally looking for minimum sum. As I wrote I was tried to a) iterate all possibilities and distribute it using mpi4py for many machines (the list above 100 elements and 20 element subsets working too long) – Szczepan G Jan 18 '19 at 06:45
  • Your question is not clear to me. Are you trying to use a parallel algorithm to split your set, or do you need to split your set in order to send the subsets over to a parallel algorithm?.. My understanding is that you want the latter, hence the apparent duplicate. – code_dredd Jan 18 '19 at 06:47
  • The task is to split set into N-elements subsets with (almost) equal sum of element. TO make it faster I would like to send paralleized the computing. – Szczepan G Jan 18 '19 at 06:50
  • That does not answer my question. – code_dredd Jan 18 '19 at 06:51
  • To be honest I investigate two option: 1. Parent generate all iteration and then send to the parallel algorithm (but this is slow for me). 2. Parent send a list and each node generates own subsets and calculated the subset sum in restricted time. Then send the best result to parent. Parent received this results and choose the best one with minimized the difference between sums in subsets. I think the second option has potential to be faster. – Szczepan G Jan 18 '19 at 07:00

1 Answers1

0

I think you're trying to do something more complicated than necessary - do you actually need an exact solution (global optimum)? Regarding the heuristic solution, I had to do something along these lines in the past so here's my take on it:

Reformulate the problem as follows: You have a vector with given mean ('global mean') and you want to break it into chunks such that means of each individual chunk will be as close as possible to the 'global mean'.

Just divide it into chunks randomly and then iteratively swap elements between the chunks until you get acceptable results. You can experiment with different ways how to do it, here I'm just reshuffling elements of chunks with the minimum at maximum 'chunk-mean'.

In general, the bigger the chunk is, the easier it becomes, because the first random split would already give you not-so-bad solution (think sample means).

How big is your input list? I tested this with 100000 elements input (uniform distribution integers). With 50 2000-elements chunks you get the result instantly, with 2000 50-elements chunks you need to wait <1min.

import numpy as np

my_numbers = np.random.randint(10000, size=100000)
chunks = 50
iter_limit = 10000
desired_mean = my_numbers.mean()
accepatable_range = 0.1

split = np.array_split(my_numbers, chunks)

for i in range(iter_limit):
    split_means = np.array([array.mean() for array in split]) # this can be optimized, some of the means are known
    current_min = split_means.min()
    current_max = split_means.max()
    mean_diff = split_means.ptp()
    if(i % 100 == 0 or mean_diff <= accepatable_range):
        print("Iter: {}, Desired: {}, Min {}, Max {}, Range {}".format(i, desired_mean, current_min, current_max, mean_diff))
    if mean_diff <= accepatable_range:
        print('Acceptable solution found')
        break
    min_index = split_means.argmin()
    max_index = split_means.argmax()
    if max_index < min_index:
        merged = np.hstack((split.pop(min_index), split.pop(max_index)))
    else:
        merged = np.hstack((split.pop(max_index), split.pop(min_index)))
    reshuffle_range = mean_diff+1
    while reshuffle_range > mean_diff:
        # this while just ensures that you're not getting worse split, either the same or better
        np.random.shuffle(merged)
        modified_arrays = np.array_split(merged, 2)
        reshuffle_range = np.array([array.mean() for array in modified_arrays]).ptp()
    split += modified_arrays
pieca
  • 2,463
  • 1
  • 16
  • 34