1

Is switching to C an adequate approach to get the combinations faster than in Python or am I on a wrong track? I only "speak" python and hope for some guidance to decide on the next programming steps for my self-chosen learning project. I am working on a Data Science project and based on your answers, I would recommend to invite a computer scientist to the project or drop my project approach.

I have a list of 69 strings where I need all possible combinations of 8 elements. At the moment I can do this in python with itertools.combinations()

for i in itertools.combinations(DictAthleteObjects.keys()),8):
   do stuff here on instances of classes

In Python the itertools.combinations works perfectly fine for a view combinations but due to the large amount of combinations it is not time efficient and sometimes crashes (I think because of too less memory) when not breaking the loop after a view iterations. Generally the time-complexity is very large.

According to this StackOverflow discussion it could be a valid approach to generate the combinations in C and also do all the programming that works in python in C, because its much faster.

On the other hand I have received a comment that itertools.combinations is using C itself. But I cannot find any sources on that.

  • 1
    You can find the C source at [itertoolsmodule.c](https://github.com/python/cpython/blob/main/Modules/itertoolsmodule.c). Its in C, but does a quite a bit of work with the python C API. You could get a little better speed because your code doesn't have to be as generic, but I doubt it would save enough to be worth your while. – tdelaney Aug 30 '22 at 22:30
  • You could check out [cython](https://cython.org) that can accelerate python code and also has a python-like syntax for even more accelerated code. – tdelaney Aug 30 '22 at 22:32
  • 1
    "do stuff here" may be the more interesting thing to focus on. If the inputs and outputs are relatively small and the processing is relatively large, you could spawn multiple processes for the work. – tdelaney Aug 30 '22 at 22:33
  • @tdelaney Thank you sooo much, for your answers after a frustrating evening and night. "do stuff here" is simply a call of a method and storing a placement. I want to simulate athletes racing again each other in groups of 8. So from your feedback I understand that it could be a good approach to move the code to another language and/or perform multiple processes. I imagined something like this and really needed this confirmation and guidance, because I am far far away of being a computer scientist. – TheArmbreaker Aug 30 '22 at 22:38
  • 1
    Profile code before attempting to optimize it. Python provides easy to use [profilers](https://docs.python.org/3/library/profile.html) which provide information about where the code is spending time. – Michael Ruth Aug 30 '22 at 22:39
  • Often you shouldn't use brute force like that at all but some better algorithm. Impossible to tell here without knowing the task details. – Kelly Bundy Aug 30 '22 at 23:09
  • Actually, *that* is rather what the discussion you linked to shows. The C version wasn't that much faster because it did the work faster but because it didn't do the work. Instead it got the result in a different way. – Kelly Bundy Aug 30 '22 at 23:17

1 Answers1

0

The comments you've received so far basically answer your question (profiling, minor gains for lots of C hassle, code redesign) but in the past handful of days I went through a similar dilemma for a home project and wanted to throw in my thoughts.

On the profiling front I just used Python's time module and a global start time variable to get basic benchmarks as my program ran. For highly complex scenarios I recommend using one of the Python profilers mentioned in the comments instead.

import time
start_time = time.process_time()
// stuff
print(f'runtime(sec): {time.process_time() - start_time}')

This allowed me to get the 10,000ft view of how long my code was taking to do various things, then I found a workable size of input data that didn't take too long to run but was a good representation of the larger dataset and tried to make incremental improvements.

After a lot of messing around I figured out what the most expensive things needed to be done were, and at the top of the list was generating those unique combinations. So what I ended up doing was splitting things up into a pipeline of sorts that allowed me to cut down the total runtime to the amount of time it takes to perform the most expensive work.

In the case of itertools.combinations it's not actually generating the real output values so it runs insanely quickly, but when it comes time to do the for loop and actually produce those values things get interesting. On my machine it took about 3ms to return the generator that would produce ~31.2B combinations if I looped over that.

# Code to check how long itertools.combinations() takes to run
import itertools
import time

data = []
for i in range(250000):
    data.append(i)

ncombos = (250000 * 249999) / 2

for num_items in range(2, 9):
    start_time = time.process_time()
    g = itertools.combinations(data, num_items)
    print(f'combo_sz:{num_items} num_combos:{ncombos} elapsed(sec):{time.process_time() - start_time}')

In my case I couldn't find a way to nicely split up a generator into parts so I decided to use the multiprocessing module (Process, Queue, Lock) to pass off data as it came in (this saves big on memory as well). In summary it was really helpful to look at things from the subtask perspective because each subtask may require something different.

Also don't be like me and skim the documentation too quickly haha, many problems can be resolved by reading that stuff. I hope you found something useful out of this reply, good luck!

Jeeshe
  • 34
  • 6