Python code optimization for uneven subset genetarion

Question

I would like to ask for help in optimizing the code. I have a list of let's say 26 elements:

indata = [0, 0, 50, 0, 32, 35, 151, 163, 9, 1, 3, 3, 42, 30, 16, 14, 85, 44, 89, 26, 0, 67, 67, 23, 0, 0]

Just for further reading: when I mention "subset" => is the sub-'set' of the data, not the data type. I'm looking for "sub-lists".

I'm preparing a function that will perform further calculations on subsets of that list. The problem is, if subsets are generated over uneven numbers sometimes the same element goes into different subsets twice or more. The subsets I'm looking for are:

subset 1 => raw data
subset 2 & 3 => first and second half of data
subset 4 - 7 => first, second, third and fourth of 1/4 of data
subsets 8 - 15 => next 1/8 of set.

I came up with a rather sloppy and long solution inside function body, that goes like this:

for i in iterate:
    if i == 0:
        subset = indata
    elif i == 1:
        subset = indata[0:int(len(indata)/2)]
    elif i == 2:
        subset = indata[int(len(indata)/2):]
    elif i == 3:
        subset = indata[0:int(len(indata)/4)]
    elif i == 4:
        subset = indata[int(len(indata)/4):int(round((len(indata)/4)*2,0))]
    elif i == 5:
        subset = indata[int(round((len(indata)/4)*2,0)):int(round((len(indata)/4)*3,0))]
    elif i == 6:
        subset = indata[int(round((len(indata)/4)*3,0)):]
    elif i == 7:
        subset = indata[0:int(len(indata)/8)]        
    elif i == 8:
        subset = indata[int(len(indata)/8):int(round((len(indata)/8)*2,0))]        
    elif i == 9:
        subset = indata[int(len(indata)/8)*2:int(round((len(indata)/8)*3,0))]        
    elif i == 10:
        subset = indata[int((len(indata)/8)*3+0.25):int(round((len(indata)/8)*4,0))]        
    elif i == 11:
        subset = indata[int((len(indata)/8)*4+0.25):int(round((len(indata)/8)*5,0))]        
    elif i == 12:
        subset = indata[int((len(indata)/8)*5+0.25):int(round((len(indata)/8)*6,0))]
    elif i == 13:
        subset = indata[int((len(indata)/8)*6+0.5):int(round((len(indata)/8)*7,0))]
    elif i == 14:
        subset = indata[int((len(indata)/8)*7+0.5):]
    else:
        subset = indata[int((len(indata)/8)*7+0.5):] 

-here go further instruction on the subset, then loop go back and repeat.

it does what it should (the 0.25 and 0.5 parts added are to avoid including same element goes to two or more subsets, when let's say length of subset is 3.25). However there must be definitely a better way to do this. I don't mind having uneven sets, lets say, when dividing by 4 to have 2 7-element lists and 2 6-element list. As long as element are distinct.

Thank you for help.

A first improvement would be to calculate a dictionary with keys: subset number, and values:(start_index, end_index), where the indices are calculated from the length of the indata — Stefan, Dec 14 '21 at 17:51
Maybe the itertools library can be used - there is a method called grouper, but I haven't used this myself — Stefan, Dec 14 '21 at 17:53

Alain T. · Accepted Answer · 2021-12-14T22:06:29.183

You can use a list comprehension to obtain these subsets:

indata = [0, 0, 50, 0, 32, 35, 151, 163, 9, 1, 3, 3, 42, 30, 16, 14, 85, 
         44, 89, 26, 0, 67, 67, 23, 0, 0]

subsets = [indata[p*size:(p+1)*size] 
           for parts in (1,2,4,8) 
           for size in [len(indata)//parts] 
           for p in range(parts)]

Output:

for i,subset in enumerate(subsets,1): print(i,subset)

1 [0, 0, 50, 0, 32, 35, 151, 163, 9, 1, 3, 3, 42, 30, 16, 14, 85, 44, 
   89, 26, 0, 67, 67, 23, 0, 0]

2 [0, 0, 50, 0, 32, 35, 151, 163, 9, 1, 3, 3, 42]
3 [30, 16, 14, 85, 44, 89, 26, 0, 67, 67, 23, 0, 0]

4 [0, 0, 50, 0, 32, 35]
5 [151, 163, 9, 1, 3, 3]
6 [42, 30, 16, 14, 85, 44]
7 [89, 26, 0, 67, 67, 23]

8 [0, 0, 50]
9 [0, 32, 35]
10 [151, 163, 9]
11 [1, 3, 3]
12 [42, 30, 16]
13 [14, 85, 44]
14 [89, 26, 0]
15 [67, 67, 23]

Note that this will drop items when the size of the list is not a multiple of the number of partitions (e.g. 26/4 and 26/8). There are several ways to handle this issue (more subsets, larger chunks, varying subset sizes to spread items evenly or randomly, add to the 1st subset, add to the last,...) but you would have to specify which one you want.

For example, this variant spreads the extra items over the first few sets (no more than 1 extra item per set):

subsets = [indata[p*size+min(p,spread):(p+1)*size+min(p+1,spread)]
           for parts in (1,2,4,8)
           for size,spread in [divmod(len(indata),parts)]
           for p in range(parts)]

for i,subset in enumerate(subsets,1): print(i,subset,len(subset))

1 [0, 0, 50, 0, 32, 35, 151, 163, 9, 1, 3, 3, 42, 30, 16, 14, 
   85, 44, 89, 26, 0, 67, 67, 23, 0, 0] 26

2 [0, 0, 50, 0, 32, 35, 151, 163, 9, 1, 3, 3, 42] 13
3 [30, 16, 14, 85, 44, 89, 26, 0, 67, 67, 23, 0, 0] 13

4 [0, 0, 50, 0, 32, 35, 151] 7
5 [163, 9, 1, 3, 3, 42, 30] 7
6 [16, 14, 85, 44, 89, 26] 6
7 [0, 67, 67, 23, 0, 0] 6

8 [0, 0, 50, 0] 4
9 [32, 35, 151, 163] 4
10 [9, 1, 3] 3
11 [3, 42, 30] 3
12 [16, 14, 85] 3
13 [44, 89, 26] 3
14 [0, 67, 67] 3
15 [23, 0, 0] 3

Thank you, the second solution with distributing the extra elements over first subsets is exactly what I was looking for! — Chris, Dec 15 '21 at 16:24

score 0 · Answer 2 · answered Dec 14 '21 at 19:21

def divide_data(data, chunks):
    idx = 0
    sizes = [len(data) // chunks + int(x < len(data)%chunks) for x in range(chunks)]
    for size in sizes:
        yield data[idx:idx+size]
        idx += size

data = list(range(26))  # or whatever, e.g. [0, 0, 50, ...]
for num_subsets in (1, 2, 4, 8):
    print(f'num subsets: {num_subsets}')
    for subset in divide_data(data, num_subsets):
        print(subset)

num subsets: 1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
num subsets: 2
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
[13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
num subsets: 4
[0, 1, 2, 3, 4, 5, 6]
[7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24, 25]
num subsets: 8
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9, 10]
[11, 12, 13]
[14, 15, 16]
[17, 18, 19]
[20, 21, 22]
[23, 24, 25]

Credit to this answer for inspiration

score 0 · Answer 3 · answered Dec 14 '21 at 20:49

You can use np.array_split + list comprehension:

sublists = [arr.tolist() for num in [1,2,4,8] for arr in np.array_split(np.array(indata), num)]

Output:

[[0, 0, 50, 0, 32, 35, 151, 163, 9, 1, 3, 3, 42, 30, 16, 14, 85, 44, 89, 26, 0, 67, 67, 23, 0, 0],
 [0, 0, 50, 0, 32, 35, 151, 163, 9, 1, 3, 3, 42],
 [30, 16, 14, 85, 44, 89, 26, 0, 67, 67, 23, 0, 0],
 [0, 0, 50, 0, 32, 35, 151],
 [163, 9, 1, 3, 3, 42, 30],
 [16, 14, 85, 44, 89, 26],
 [0, 67, 67, 23, 0, 0],
 [0, 0, 50, 0],
 [32, 35, 151, 163],
 [9, 1, 3],
 [3, 42, 30],
 [16, 14, 85],
 [44, 89, 26],
 [0, 67, 67],
 [23, 0, 0]]

Python code optimization for uneven subset genetarion

3 Answers3