1

I need to split the total number of elements in iterator : tot= itertools.combinations(dict1.keys(), 2) into 3 parts.

The size of dict1 = 285056 Total combinations possible = 40billion

My goal is to somehow divide these 40billion into 3 parts of 13.5 billion elements each to process on different processors parallely. At the moment i am naively iterating the 40billion and dumping pickle files when i reach 13.5 billion which isnt efficient as each 13.5 billion pickle is 160gb on disk (much larger when loaded in memory)

So is there any way I could iterate the 40billion till 13.5billionth element in one code and then start from 13.6 billionth element in code 2 and so on without iteration like i did. Below code i use to get certain number of elements from combinations iterable.

def grouper(n, iterable):
      it = iter(iterable)
      while True:
          chunk = tuple(itertools.islice(it, n))
          if not chunk:
              return
          yield chunk
for first_chunk in grouper(1350000000,tot ): 
skynaive
  • 49
  • 5
  • 1
    A better question might be: why do you want to store data that can be so easily generated on the fly? Notwithstanding the other one: are you sure that you really need to generate and manipulate all these combinations? – Thierry Lathuille Dec 27 '21 at 08:52
  • In general, there is no way to "split up" an iterator like that. – juanpa.arrivillaga Dec 27 '21 at 08:54
  • @ThierryLathuille I do not want to store it. But could not find a way to iterate same iterator on 3 different machines at certain offsets. Machine 1 start at 0 to 13.5 billion, machine 2 should start at 13.6 billion directly – skynaive Dec 27 '21 at 09:04

2 Answers2

0

It is easy to create this kind of split with itertools. Given a set of elements, we can test if the first part of the generated combination belongs to the computation for machine i.

In the code below, I show a crude solution for this, with the code in the for loop intended to be split over 3 machines. Machine i will run the code for the i-th segment of the keys for the first element of the combination, combined with the full set for the second element.

The combinations are supposed to be processed in the line where cnt2 is calculated. Replace that with the kind of for loop you want to process your combinations with.

Compared with generating and storing all combinations, this solution does not store any, but it will (internally) generate all. But what's a couple of billion combinations between friends?

import itertools

def is_not_for_machine(i, t):
    """ t is in the set if first element
        in my_set_prefix[i] for machine i """
    if my_set_prefix[i][0] <= t[0] < my_set_prefix[i][1]:
        return False
    return True

my_set_prefix = []
for i in range(3):
    my_set_prefix.append((len(my_keys)*i//3, len(my_keys)*(i+1)//3))
print(f"== partition: {my_set_prefix}")
my_keys = range(12)
all = itertools.combinations(my_keys, 2)
cnt = len([_ for _ in all])
print(f"== total set size {cnt}")
for i in range(3):
    all = itertools.combinations(my_keys, 2)
    cnt2 = len([_ for _ in itertools.filterfalse(lambda t: is_not_for_machine(i, t), all)])
    print(f"== set size for prefix {my_set_prefix[i]}: {cnt2}")

The output shows that some load balancing might be necessary, since this partition is "triangular descending", with the highest count for the first combinations.

== partition: [(0, 4), (4, 8), (8, 12)]
== total set size 66
== set size for prefix (0, 4): 38
== set size for prefix (4, 8): 22
== set size for prefix (8, 12): 6
VirtualScooter
  • 1,792
  • 3
  • 18
  • 28
0

why not directly use the math.comb command to get the number of combinations?

Just go to the question there.

Fang WU
  • 180
  • 2
  • 7