tl;dr
You're asking for a miracle. No programming language can help you out of this one. Use better approaches than what you're considering doing!
Your Solution has Exponential Time Complexity
Let's assume a perfect algorithm: one that can give you a new partition in constant O(1) time. In other words, no matter what the input, a new partition can be generated in a guaranteed constant amount of time.
Let's in fact go one step further and assume that your algorithm is only CPU-bound and is operating under ideal conditions. Under ideal circumstances, a high-end CPU can process upwards of 100 billion instructions per second. Since this algorithm takes O(1) time, we'll say, oh, that every new partition is generated in a hundred billionth of a second. So far so good?
Now you want this to perform well. You say you want this to be able to handle an input of size m. You know that that means you need about pow(2,m)
iterations of your algorithm - that's the number of partitions you need to generate, and since generating each algorithm takes a finite amount of time O(1), the total time is just pow(2,m)
times O(1). Let's take a quick look at the numbers here:
m = 20 means your time taken is pow(2,20)
*10^-11 seconds = 0.00001 seconds. Not bad.
m = 40 means your time taken is pow(2,40)
10-11 seconds = 1 trillion/100 billion = 10 seconds. Also not bad, but note how small m = 40 is. In the vast panopticon of numbers, 40 is nothing. And remember we're assuming ideal conditions.
m = 100 means 10^41 seconds! What happened?
You're a victim of algorithmic theory. Simply put, a solution that has exponential time complexity - any solution that takes 2^m time to complete - cannot be sped up by better programming. Generating or producing pow(2,m)
outputs is always going to take you the same proportion of time.
Note further that 100 billion instructions/sec is an ideal for high-end desktop computers - your CPU also has to worry about processes other than this program you're running, in which case kernel interrupts and context switches eat into processing time (especially when you're running a few thousand system processes, which you no doubt are). Your CPU also has to read and write from disk, which is I/O bound and takes a lot longer than you think. Interpreted languages like Python also eat into processing time since each line is dynamically converted to bytecode, forcing additional resources to be devoted to that. You can benchmark your code right now and I can pretty much guarantee your numbers will be way higher than the simplistic calculations I provide above. Even worse: storing 2^40 permutations requires 1000 GBs of memory. Do you have that much to spare? :)
Switching to a lower-level language, using generators, etc. is all a pointless affair: they're not the main bottleneck, which is simply the large and unreasonable time complexity of your brute force approach of generating all partitions.
What You Can Do Instead
Use a better algorithm. Generating pow(2,m)
partitions and investigating all of them is an unrealistic ambition. You want, instead, to consider a dynamic programming approach. Instead of walking through the entire space of possible partitions, you want to only consider walking through a reduced space of optimal solutions only. That is what dynamic programming does for you. An example of it at work in a problem similar to this one: unique integer partitioning.
Dynamic programming problems approaches work best on problems that can be formulated as linearized directed acyclic graphs (Google it if not sure what I mean!).
If a dynamic approach is out, consider investing in parallel processing with a GPU instead. Your computer already has a GPU - it's what your system uses to render graphics - and GPUs are built to be able to perform large numbers of calculations in parallel. A parallel calculation is one in which different workers can do different parts of the same calculation at the same time - the net result can then be joined back together at the end. If you can figure out a way to break this into a series of parallel calculations - and I think there is good reason to suggest you can - there are good tools for GPU interfacing in Python.
Other Tips
Be very explicit on what you mean by best. If you can provide more information on what best means, we folks on Stack Overflow might be of more assistance and write such an algorithm for you.
Using a bare-metal compiled language might help reduce the amount of real time your solution takes in ordinary situations, but the difference in this case is going to be marginal. Compiled languages are useful when you have to do operations like searching through an array efficiently, since there's no instruction-compilation overhead at each iteration. They're not all that more useful when it comes to generating new partitions, because that's not something that removing the dynamic bytecode generation barrier actually affects.