1

Each sample is an array of features (ints). I need to split my samples into two separate groups by figuring out what the best feature, and the best splitting value for that feature, is. By "best", I mean the split that gives me the greatest entropy difference between the pre-split set and the weighted average of the entropy values on the left and right sides. I need to try all (2^m−2)/2 possible ways to partition these items into two nonempty lists (where m is the number of distinct values (all samples with the same value for that feature are moved together as a group))

The following is extremely slow so I need a more reasonable/ faster way of doing this.

sorted_by_feature is a list of (feature_value, 0_or_1) tuples.

same_vals = {}
            for ele in sorted_by_feature:
                if ele[0] not in same_vals:
                    same_vals[ele[0]] = [ele]
                else:
                    same_vals[ele[0]].append(ele)



            l = same_vals.keys()
            orderings = list(itertools.permutations(l))
            for ordering in orderings:

                list_tups = []
                for dic_key in ordering:
                    list_tups += same_vals[dic_key]

                left_1 = 0 
                left_0 = 0 
                right_1 = num_one
                right_0 = num_zero


                for index, tup in enumerate(list_tups):
                       #0's or #1's on the left +/- 1
                       calculate entropy on left/ right, calculate entropy drop, etc.

Trivial details (continuing the code above):

                if index == len(sorted_by_feature) -1:
                        break
                    if tup[1] == 1:
                        left_1 += 1
                        right_1 -= 1

                    if tup[1] == 0:
                        left_0 += 1
                        right_0 -= 1


                    #only calculate entropy if values to left and right of split are different


                    if list_tups[index][0] != list_tups[index+1][0]:
Jobs
  • 3,317
  • 6
  • 26
  • 52
  • 1
    @AkshatMahajan I'm using itertools.permutations to get all permutations of the values. – Jobs Apr 08 '16 at 18:35
  • @ScottHunter m, the number of distinct values, ranges from 2 to ~40 – Jobs Apr 08 '16 at 18:36
  • @Jobs Fair enough. Note that getting all O(2^m) partitions is (obviously) exponential in space. It translates to exponential time complexity: even if getting a single partition took O(1) time, it'd still have to run for 2^m iterations, making it O(2^m). Note that this implies (for m = 40) 1 trillion operations. Don't know how fast your computer is, but even with 1 billion instructions/sec, it'll take 16 minutes. – Akshat Mahajan Apr 08 '16 at 18:37
  • @Jobs: Simply put: a bad algorithm can't ever be offset by good programming constructs. Are you absolutely sure you need to iterate through _every_ possible combination? There aren't any optimisations you can take to improve it? – Akshat Mahajan Apr 08 '16 at 18:39
  • How do you determine "best splitting value"? Is there a threshold between "good enough splitting value" and "best splitting value" where you could stop searching when you find one that is good enough, and then focus on trying to optimize which you check first, such as with a greedy-walk or similar where you pick the best next discrete value? – Matt Jordan Apr 08 '16 at 18:45
  • You'll end up with a dynamic programming solution. I mean I have no idea what you're doing, but with such a structure that's pretty much a guarantee. You should find enough under that keyword – Voo Apr 08 '16 at 18:56

2 Answers2

3

tl;dr

You're asking for a miracle. No programming language can help you out of this one. Use better approaches than what you're considering doing!

Your Solution has Exponential Time Complexity

Let's assume a perfect algorithm: one that can give you a new partition in constant O(1) time. In other words, no matter what the input, a new partition can be generated in a guaranteed constant amount of time.

Let's in fact go one step further and assume that your algorithm is only CPU-bound and is operating under ideal conditions. Under ideal circumstances, a high-end CPU can process upwards of 100 billion instructions per second. Since this algorithm takes O(1) time, we'll say, oh, that every new partition is generated in a hundred billionth of a second. So far so good?

Now you want this to perform well. You say you want this to be able to handle an input of size m. You know that that means you need about pow(2,m) iterations of your algorithm - that's the number of partitions you need to generate, and since generating each algorithm takes a finite amount of time O(1), the total time is just pow(2,m) times O(1). Let's take a quick look at the numbers here:

  • m = 20 means your time taken is pow(2,20)*10^-11 seconds = 0.00001 seconds. Not bad.

  • m = 40 means your time taken is pow(2,40)10-11 seconds = 1 trillion/100 billion = 10 seconds. Also not bad, but note how small m = 40 is. In the vast panopticon of numbers, 40 is nothing. And remember we're assuming ideal conditions.

  • m = 100 means 10^41 seconds! What happened?

You're a victim of algorithmic theory. Simply put, a solution that has exponential time complexity - any solution that takes 2^m time to complete - cannot be sped up by better programming. Generating or producing pow(2,m) outputs is always going to take you the same proportion of time.

Note further that 100 billion instructions/sec is an ideal for high-end desktop computers - your CPU also has to worry about processes other than this program you're running, in which case kernel interrupts and context switches eat into processing time (especially when you're running a few thousand system processes, which you no doubt are). Your CPU also has to read and write from disk, which is I/O bound and takes a lot longer than you think. Interpreted languages like Python also eat into processing time since each line is dynamically converted to bytecode, forcing additional resources to be devoted to that. You can benchmark your code right now and I can pretty much guarantee your numbers will be way higher than the simplistic calculations I provide above. Even worse: storing 2^40 permutations requires 1000 GBs of memory. Do you have that much to spare? :)

Switching to a lower-level language, using generators, etc. is all a pointless affair: they're not the main bottleneck, which is simply the large and unreasonable time complexity of your brute force approach of generating all partitions.

What You Can Do Instead

  • Use a better algorithm. Generating pow(2,m) partitions and investigating all of them is an unrealistic ambition. You want, instead, to consider a dynamic programming approach. Instead of walking through the entire space of possible partitions, you want to only consider walking through a reduced space of optimal solutions only. That is what dynamic programming does for you. An example of it at work in a problem similar to this one: unique integer partitioning. Dynamic programming problems approaches work best on problems that can be formulated as linearized directed acyclic graphs (Google it if not sure what I mean!).

  • If a dynamic approach is out, consider investing in parallel processing with a GPU instead. Your computer already has a GPU - it's what your system uses to render graphics - and GPUs are built to be able to perform large numbers of calculations in parallel. A parallel calculation is one in which different workers can do different parts of the same calculation at the same time - the net result can then be joined back together at the end. If you can figure out a way to break this into a series of parallel calculations - and I think there is good reason to suggest you can - there are good tools for GPU interfacing in Python.

Other Tips

  • Be very explicit on what you mean by best. If you can provide more information on what best means, we folks on Stack Overflow might be of more assistance and write such an algorithm for you.

  • Using a bare-metal compiled language might help reduce the amount of real time your solution takes in ordinary situations, but the difference in this case is going to be marginal. Compiled languages are useful when you have to do operations like searching through an array efficiently, since there's no instruction-compilation overhead at each iteration. They're not all that more useful when it comes to generating new partitions, because that's not something that removing the dynamic bytecode generation barrier actually affects.

Community
  • 1
  • 1
Akshat Mahajan
  • 9,543
  • 4
  • 35
  • 44
0

A couple of minor improvements I can see:

  1. Use try/catch instead of if not in to avoid double lookup of keys

    if ele[0] not in same_vals:
        same_vals[ele[0]] = [ele]
    else:
        same_vals[ele[0]].append(ele)
    # Should be changed to
    try:
        same_vals[ele[0]].append(ele) # Most of the time this will work
    catch KeyError:
        same_vals[ele[0]] = [ele]
    
  2. Dont explicitly convert a generator to a list if you dont have to. I dont immediately see any need for your casting to a list, which would slow things down

    orderings = list(itertools.permutations(l))
    for ordering in orderings:
    # Should be changed to 
    for ordering in itertools.permutations(l):
    

But, like I said, these are only minor improvements. If you really needed this to be faster, consider using a different language.

Marc J
  • 1,303
  • 11
  • 11