Choosing python data structures to speed up algorithm implementation

Question

So I'm given a large collection (roughly 200k) of lists. Each contains a subset of the numbers 0 through 27. I want to return two of the lists where the product of their lengths is greater than the product of the lengths of any other pair of lists. There's another condition, namely that the lists have no numbers in common.

There's an algorithm I found for this (can't remember the source, apologies for non-specificity of props) which exploits the fact that there are fewer total subsets of the numbers 0 through 27 than there are words in the dictionary.

The first thing I've done is looped through all the lists, found the unique subset of integers that comprise it and indexed it as a number between 0 and 1<<28. As follows:

def index_lists(lists):
    index_hash = {}
    for raw_list in lists:
        length = len(raw_list)

        if length > index_hash.get(index,{}).get("length"):
           index = find_index(raw_list)
           index_hash[index] = {"list": raw_list, "length": length}

    return index_hash

This gives me the longest list and the length of the that list for each subset that's actually contained in the collection of lists given. Naturally, not all subsets from 0 to (1<<28)-1 are necessarily included, since there's not guarantee the supplied collection has a list containing each unique subset.

What I then want, for each subset 0 through 1<<28 (all of them this time) is the longest list that contains at most that subset. This is the part that is killing me. At a high level, it should, for each subset, first check to see if that subset is contained in the index_hash. It should then compare the length of that entry in the hash (if it exists there) to the lengths stored previously in the current hash for the current subset minus one number (this is an inner loop 27 strong). The greatest of these is stored in this new hash for the current subset of the outer loop. The code right now looks like this:

def at_most_hash(index_hash):
    most_hash = {}
    for i in xrange(1<<28):  # pretty sure this is a bad idea
        max_entry = index_hash.get(i)
        if max_entry:
           max_length = max_entry["length"]
           max_word = max_entry["list"]
        else:
           max_length = 0
           max_word = []
        for j in xrange(28):  #  again, probably not great
           subset_index = i & ~(1<<j) # gets us a pre-computed subset
           at_most_entry = most_hash.get(subset_index, {})
           at_most_length = at_most_entry.get("length",0)
           if at_most_length > max_length:
              max_length = at_most_length
              max_list = at_most_entry["list"]
        most_hash[i] = {"length": max_length, "list": max_list}
    return most_hash

This loop obviously takes several forevers to complete. I feel that I'm new enough to python that my choice of how to iterate and what data structures to use may have been completely disastrous. Not to mention the prospective memory problems from attempting to fill the dictionary. Is there perhaps a better structure or package to use as data structures? Or a better way to set up the iteration? Or maybe I can do this more sparsely?

The next part of the algorithm just cycles through all the lists we were given and takes the product of the subset's max_length and complementary subset's max length by looking them up in at_most_hash, taking the max of those.

Any suggestions here? I appreciate the patience for wading through my long-winded question and less than decent attempt at coding this up.

In theory, this is still a better approach than working with the collection of lists alone since that approach is roughly o(200k^2) and this one is roughly o(28*2^28 + 200k), yet my implementation is holding me back.

I'm very confused by your second sentence: "Each contains the numbers 0 through 27." Do you really mean that they contain a subset of those numbers? Otherwise the whole rest of the question makes no sense since every list has the same numbers in it and the same length. — Blckknght, Aug 10 '16 at 04:04
@AsinglePANCAKE no worries. yes: http://codereview.stackexchange.com/ — Julien, Aug 10 '16 at 04:07
This algorithm doesn't really sound like an improvement over a brute-force consideration of all possible pairs. — user2357112, Aug 10 '16 at 04:10
@user2357112: I'm pretty sure it is. If the initial collection is short it's a dumb algorithm, but for large collections it's much better. All possible pairs would be n^2, and for n=200,000, that's a lot worse than 2^28 or so. We can shorten the set of prospective list elements from 0,...,27 to 0,...,15 if that makes it seem like the question is more worth answering. — AsinglePANCAKE, Aug 10 '16 at 04:13
I would recommend using a profiler to figure out what part is the bottleneck and if optimizing things is worth it. — hugomg, Aug 10 '16 at 04:16
@AsinglePANCAKE: The work scales a lot worse than just 2 to the size of the alphabet, though. In particular, that `for j in xrange(28)` loop is wrong, and the correct version is much slower. (That, or the correct version is about equally slow, but you're setting up `index_hash` wrong and the right way to do *that* is much slower.) — user2357112, Aug 10 '16 at 04:16
@AsinglePANCAKE: I think I see what you're doing now; `at_most_hash` is intended to be a preprocessing phase that bubbles up lengths one level at a time so that `most_hash[i]` holds the longest list whose elements are a subset of the set represented by `i`. The description was throwing me off, as well as your earlier question I answered about enumerating subsets of a set represented this way. The bug I was looking at is on the `i & ~j` line, and if I'm interpreting this right, the correction is `i & ~(1< — user2357112, Aug 10 '16 at 04:31
@user2357112: yeah, that's right. As similar as some of the logic is being used, the questions were actually unrelated! Also, thanks for catching that. — AsinglePANCAKE, Aug 10 '16 at 04:33
The complexity is still actually not that big an improvement over all-pairs brute force, unfortunately. — user2357112, Aug 10 '16 at 04:49
@user2357112: can you ground that in some analysis? I'm pretty sure that this scales well as the size of the collection goes up, but sucks for smaller collections. — AsinglePANCAKE, Aug 10 '16 at 04:53
@AsinglePANCAKE: It does scale well as the size of the collection goes up, but 200k is actually pretty close to the "smaller collections" borderline. — user2357112, Aug 10 '16 at 04:59

user2357112 · Accepted Answer · 2016-08-10T06:30:21.503

2

Given that your indexes are just ints, you could save some time and space by using lists instead of dicts. I'd go further and bring in NumPy arrays. They offer compact storage representation and efficient operations that let you implicitly perform repetitive work in C, bypassing a ton of interpreter overhead.

Instead of index_hash, we start by building a NumPy array where index_array[i] is the length of the longest list whose set of elements is represented by i, or 0 if there is no such list:

import numpy

index_array = numpy.zeros(1<<28, dtype=int)  # We could probably get away with dtype=int8.
for raw_list in lists:
    i = find_index(raw_list)
    index_array[i] = max(index_array[i], len(raw_list))

We then use NumPy operations to bubble up the lengths in C instead of interpreted Python. Things might get confusing from here:

for bit_index in xrange(28):
    index_array = index_array.reshape([1<<(28-bit_index), 1<<bit_index])
    numpy.maximum(index_array[::2], index_array[1::2], out=index_array[1::2])

index_array = index_array.reshape([1<<28])

Each reshape call takes a new view of the array where data in even-numbered rows corresponds to sets with the bit at bit_index clear, and data in odd-numbered rows corresponds to sets with the bit at bit_index set. The numpy.maximum call then performs the bubble-up operation for that bit. At the end, each cell index_array[i] of index_array represents the length of the longest list whose elements are a subset of set i.

We then compute the products of lengths at complementary indices:

products = index_array * index_array[::-1]  # We'd probably have to adjust this part
                                            # if we picked dtype=int8 earlier.

find where the best product is:

best_product_index = products.argmax()

and the longest lists whose elements are subsets of the set represented by best_product_index and its complement are the lists we want.

edited Aug 10 '16 at 06:30

answered Aug 10 '16 at 05:19

user2357112

260,549
28
431
505

Holy...the speed gain is beyond insane. Well played, sir. – AsinglePANCAKE Aug 10 '16 at 05:39
"Each reshape call takes a new view of the array where data in even-numbered rows corresponds to sets with the bit at bit_index clear, and data in odd-numbered rows corresponds to sets with the bit at bit_index set. " What do you mean by "clear" and "set" here? – AsinglePANCAKE Aug 10 '16 at 05:53
@AsinglePANCAKE: "clear" means there's a 0 in that bit, and "set" means there's a 1 in that bit. – user2357112 Aug 10 '16 at 06:22
Is there a good way to carry which list produced the given max. I can construct a word_array alongside the numpy array without too much trouble, but with all the reshaping based on taking maxima, it seems as if it might be hard to pull the max-producing lists along for the ride. – AsinglePANCAKE Aug 10 '16 at 18:06
@AsinglePANCAKE: You could probably do it by using `numpy.argmax` instead of `numpy.maximum` to select the indices of the maxima instead of the maxima themselves, and then using the indices to bubble up lists and lengths together, but I think I'd probably just brute-force search through the original lists to find the ones we want once we have `best_product_index`. – user2357112 Aug 10 '16 at 18:21
So one thing about scanning through products is that we're scanning through something of size 1<<28. If, for each list in the original collection, we kept its subset index in an array, word_subsets of size len(collection), we could compute the max length as `max(index_array[subset]*index_array[(1<<28) - subset - 1] for subset in word_subsets)`. In this way, we'd be scanning through something only of len(collection) instead of 1<<28. If I do that, argmax is probably the way to go yeah? – AsinglePANCAKE Aug 10 '16 at 18:48
@AsinglePANCAKE: You might be able to save some time that way, although you'd want to do something like `numpy.argmax(index_array[word_subsets] * index_array[(1<<28) - 1 - word_subsets])` to avoid the Python loop. I don't know if it'd save much time; I think either the bubble-up phase or the initialization of `index_array` would dominate the runtime. Also, the index you get won't immediately tell you which list you want, because you still could have gotten a bubbled-up length, or two lists with the same set of elements. – user2357112 Aug 10 '16 at 22:24
I guess I can only get the max from there and not the index of the max. I have to admit I'm a bit confused how to use argmax to pull out the lists that produce the max length product since np.maximum will give me the max indices pre bubble up. Perhaps you're right that I should just leave the product in there as is. Although if I haven't beaten this dead horse too much yet, I'd appreciate some details with respect to using argmax. – AsinglePANCAKE Aug 10 '16 at 22:42
One of the things I tried was storing indices in `index_array`, indices which map to a position in `list_lengths`. list_lengths is literally just the original collection of lists where the list is replaced by its length. Then, instead of numpy.maximum i do `index_array[1::2] = np.where(list_lengths[longest_evens] > list_lengths[longest_odds], longest_evens, longest_odds)` and longest_evens and longest_odds are just the `index_array[::2]` stuff. However, this takes a lot longer to run. On the dictionary of 100k words i'm using, it's 34 seconds to run now instead of 7. Speed tips? – AsinglePANCAKE Aug 12 '16 at 20:48

score 1 · Answer 2 · answered Aug 10 '16 at 04:21

1

This is a bit too long for a comment so I will post it as an answer. One more direct way to index your subsets as integers is to use "bitsets" with each bit in the binary representation corresponding to one of the numbers.

For example, the set {0,2,3} would be represented by 2⁰ + 2² + 2³ = 13 and {4,5} would be represented by 2⁴ + 2⁵ = 48

This would allow you to use simple lists instead of dictionaries and Python's generic hashing function.

answered Aug 10 '16 at 04:21

hugomg

68,213
24
160
246

This is how they are indexed. I left off the part where I do it, since the crux of my question was in a post-indexing world. I didn't realize that hashing those was bad... – AsinglePANCAKE Aug 10 '16 at 04:23
Reading the explanation of your algorithm there doesn't seem to be anything obviously wrong in a big-O sense. Since we are not trying to optimize constant factors, I would recommend using a profiler to find out what are your bottlenecks. – hugomg Aug 10 '16 at 04:25
Appreciate the read and review. Is there a way to profile code that doesn't require the script to actually finish to get a running analysis of what's going on? Aside from putting timestamps everywhere.... – AsinglePANCAKE Aug 10 '16 at 04:29
Try http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script. AFAIK, it should still work if you kill your script halfway through. If it doesn't then in your case you can temporarily reduce the size of the `for i in xrange(1<<28):` loop for a similar effect. – hugomg Aug 10 '16 at 04:33

Choosing python data structures to speed up algorithm implementation

2 Answers2