Bottom up set generation and ordering

Question

Any method about any numerical methods you know which may be relevant, please post it here!

Background

I have an array of values for each set, and the index to each value corresponds to the set the value is bound to, therefore I represent a set as an integer, where elements represent the bit position, E.g. a set with element one in it is represented as ...001 where 1 is the LSB.

So the set is only an index and never stored, it is generated on the fly, it is the key that leads to the index in the array that represent values of sets.

What I do is given a set, is the summed value for any of the pairwise disjoint subset greater than the value for that set. E.g. if set 0111 have a value of 3, where two subsets have the value of 0100 = 2 and 0011 = 2, then this splitting is more beneficial to do. I do this for all subsets of the set.

Given three agents and the ordering is the sets number representation.

val[8] = {0,1,2,4,3,2,4,2} the values is not important, only how they are ordered
          0 0 0 0 1 1 1 1 MSB bit representation of the index
          0 0 1 1 0 0 1 1
          0 1 0 1 0 1 0 1 LSB

Best splitting of 111 is 011 and 100 with a sum of 7. So to get the value of the set which contain only the first element, ergo 001, you put val[1], for set with element 1 and 3(101), you put val[5].

How the val array is ordered when grouped by cardinality

val[8] = {0,1,2,3,4,2,4,2}
          0 0 0 1 0 1 1 1 MSB bit representation of the index
          0 0 1 0 1 0 1 1
          0 1 0 0 1 1 0 1 LSB

Here you have to translate the index to the right bin in the array, so it would look like this for a set with only the third element in it(100), val[translate(4)]. Think arrays of size >2^25 elements. Look at Improving random memory access when random access is needed for further clarification.

However, this results in a high order of random access in the memory, even if I group them after cardinality. Currently grouping them by cardinality, and generating an index is slower than ordering them after the number the set represents.

The way I generate an index with the sets grouped by cardinality is by using pascals triangle in constant memory as described by the answer in Determin the lexicographic distance between two integers

Where the sets value is located when it is ordered and grouped by cardinality with four agents

n index 1  2  4  8     3  5  6  9  10 12    7  11 13 14    15
        -----------------------------------------------------
MSB     0  0  0  1  |  0  0  0  1  1  1  |  0  1  1  1  |  1
        0  0  1  0  |  0  1  1  0  0  1  |  1  0  1  1  |  1
        0  1  0  0  |  1  0  1  0  1  0  |  1  1  0  1  |  1
LSB     1  0  0  0  |  1  1  0  1  0  0  |  1  1  1  0  |  1

n index represent the index it would have if not ordered in cardinality. This is just to show where the value for each set is located.

The integer set represent an index in the value array, either through direct index(what I am currently doing, gives random access) or through a translation from the set to an index.

The idea

Instead of splitting a set into subsets, I though of generating the sets bottom up. E.g. instead of splitting 0111 to all of its pairwise disjoint subsets, I would at some point generate if from the sets {0100,0011},{0010,0101},{0001,0110}.

How and why it should work

Say we want to evaluate all the splittings of the sets with a cardinality of 3, ergo sets 7,11,13,14. As the only way to split a set of cardinality 3 is by splitting into sets of cardinality 1 and 2, we need to evaluate if the sum of any of all the disjoint subsets of cardinality 1 and 2 is greater than the union of those sets.

Notation of what is required(may be a little flawed):

|C|=n,∀ a,b : a ∪ b = C , a ∩ b ={Ø}, |a|+|b| = n

So by reading in the values using coalesced memory access to each thread, for each subsets that form a set of cardinality n, check if it its value is greater than the formed set, if so, update the value.

Simple example, if n = 2 then you should read in all values with cardinality 1, and do all combinations of those sets and update accordingly. This example is easy as all sets are disjoint:

pseudo code for 4 threads, input card1 is pointer to array of sets |s| =1
__shared__ int value[4];
tid = threadIdx.x;
value[tid] = card1[tid]; // coalesced memory access
int thvalue = value[tid]; // holds the value for the thread, to avoid bank conflict
int rvalue[blockDim.x/2]= 0; //holds the sum
int i = blockDim.x;
int x = 0;
//reduction loop that dont generate duplicate sets
for(;i>0;i>>=1) {
    if(tid < i) {
        x++;
        rvalue[x-1] = value[(tid+x)%blockDim.x] + thvalue; 
    }
}
for(i = 0; i < x; i++) {
    int index = getindex(tid,i,1); //gets the index for the set it generated, 1 represent the cardinality
    if(output[index] < rvalue[i])
        output[index] = rvalue[i];
}

Iteration of the reduction loop

Thread set specific for thread  first iteration second iteration 
0      0001                     0001 + 0010     0001 + 0100
1      0010                     0010 + 0100     0010 + 1000
2      0100                     0100 + 1000     none
3      1000                     1000 + 0001     none

As you see, it have fetched all the values for all the subset that form sets of cardinality 2.

The problem is however that generating sets of cardinality greater than 2 is more trickier, due to not all sets are disjoint. E.g. 0001 and 0011 are not disjoint.

Keep in mind that I do not store the sets anywhere, only the value for the sets.

Finally

How would you go about, having this in mind, creating an algorithm that reads in the memory coalesced, and generating all sets from disjoint subsets. Without checking whether the subsets are disjoint, it should be completely deterministic.

For bounty

The algorithm, should be either be described text with distinct steps marked out, or pseudo code.

It should be proven with examples that it works. Not that this algorithm goes up to n^32 sets, so it need to scale well.

The algorithm is allowed to be spitted to two or more instances, E.g. one for even number and one for odd.

I would gladly be referred to sources about the technique you use.

The algorithm should use as few assignments and instructions as possible and should avoid any divergence. But if you think you got one even-though you have a lot of this, try and post, I will be happy with any information.

If it is ordered in another way but it still works as I have described, I urge you to please post it here, any help is really helpful

Please ask if there is anything unclear.

TL/DR Simple explanation

I have an array Z with values, the index i as in Z[i] represent an integer set, depending on the ordering of Z, The values is grouped by cardinality, and ordered by binary lexicographical permutation -> the position the sets value is located 1,2,4,3,5,6,7 <- so I use an function(I have this function implemented) to translate the index to the correct index. E.g. Set 3-> index 4.

By having the values for the set grouped by cardinality, what I want is, want to see if any of the pairwise disjoint sets value is greater than the set they form.

E.g. |a| = 3, |b|+|c| =3, b ∩ c ={Ø}, |b| =1 So reading in X amount of values of type b, and X amount of values from type c, find all the disjoint subsets of b and c that from type a(sets of cardinality 3) and get their sum. Continue until all the sets have been "generated"

For reference

Hamming weight based indexing

Determin the lexicographic distance between two integers

Improving random memory access when random access is needed

I may just have missed it, but how is the value for a set determined? (as in value of `0111` = 3). Is random access, cardinality, etc. all part of [the other question](http://stackoverflow.com/questions/13905645/improving-random-memory-access-when-random-access-is-needed)? (as in, for this question, you only want an algorithm that assumes sufficiently short time lookup for values of bit strings (obviously you ideally want one that solves both problems)). — Bernhard Barker, Jan 04 '13 at 13:11
@Dukeling Have updated some in my text, but the values are predetermined, and all I do is evaluating if any splitting of a set is more beneficial than the set itself. What I currently have is working good, just horrible random access. What I want, is an algorithm that does what the pseudo code does, but for sets of all cardinalities. So an algorithm that pull in the values associated with sets of known cardinality in coalesced manner, and calculates all the sums of two sets that form a set of greater cardinality. So it may read in sets of cardinality 1 and 2 to form sets of cardinality 3. — 1-----1, Jan 04 '13 at 14:29

score 1 · Answer 1 · answered Jan 05 '13 at 21:35

I don't know whether this will help you or not, but I found a branchless count-all-the-1-bits-in-a-word function in Hacker's Delight that seems like it might be useful in helping you determine the cardinality of a set:

int pop(unsigned int x) {
    x = x - ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = x + (x >> 8);
    x = x + (x >> 16);
    return x & 0x0000003F;
}

In the text, Warren claims that the above sequence can be compiled down to as little as 21 instructions. However, using MSVC 2010 on an i7 dev machine I checked the disassembly for this function and found that it clocked in at around 22 instructions for the actual computation and 33 instructions in total (counting stack ops). On a modern CPU or GPU, it should be pretty fast, since it has no branching.

CUDA got a __popc instruction for population count, what I am looking for is an algorithm, that can do bottom up set generation that evaluates on values from the subsets that generated the set. But thank you for your consideration. — 1-----1, Jan 05 '13 at 21:57

score 1 · Answer 2 · edited May 23 '17 at 10:24

Try exploiting ternany numbering

For terminology I am calling "value" your set valuation function and "target" your target function which is max of sum of values over every binary partition.

Every splitting of a binary number B into two disjoint parts, L and R, can be represented by a ternary number C, where

B = L | R   (bitwise OR)
L ^ R = 0   (bitwise XOR)

C[i] = 0 means B[i] = 0 and L[i] = R[i] = 0
C[i] = 1 means B[i] = 1 and L[i] = 1
C[i] = 2 means B[i] = 2 and R[i] = 1

Then "simply" enumerate numbers from 1 to 3**n in ternary: eg (n=3): 000, 001, 002, 010, 011, 012, 020, ...

OK, actually, efficiently counting in ternary is not completely trivial when all you have at hand is binary. But bear with me, I will explain that bit after going over the high level algo ...

So you count in ternary in order, and given a ternary number C, you obtain L and R - how? I'll explain that below too, trust me :)

Given L and R you can now look up your valuation at L and R and update the target at B: target[B] = max(val[L], val[R]).

OK that's the high-level algo. I can't prove it on such short notice but it does seem like it has very good cache locality properties. In other words value[L] and value[R] will tend to stay in a small number of cache lines at a time. Also I think the best bet for parallelizing is to split i into values modulo 3, or values modulo 9, etc.

efficient ternary counting in binary

How can we count in ternary efficiently? Try the following: Count in base 4, and skip some.

In other words a ternary digit will be represented by two bits, and we will disallow the combination 11.

 repr | value
 0 0  | 0
 0 1  | 1
 1 0  | 2
 1 1  | *undefined*

Now, how do we efficiently know when to skip ? Well, the pattern of increments is easy enough to figure out:

1 1 2 1 1 2 1 1 6 1 1 2 1 1 2 1 1 6 1 1 2 1 1 2 1 1 22 1 1 2 ...

My suggestion would be to precaclulate a large chunk of size a power of 3 (eg 3 ** 7 = 2187) and compute the nth power of 3 on the fly once in a while [hint : it's related to cubes of n ..].

So you start with 00.00.00. You add 1 that's 00.00.01. You add 1 that's 00.00.10. Now you have to add 2 in order to skip the 11 combination, that leaves you with 00.01.00. Etc.

how to obtain L and R from C

Now C in our quaternary representation of ternary is actually simply L and R interleaved. To get L and R back efficiently, you can check the answer to this S/O question or apply other bit twiddling hacks.

afterthought

All in all, I'm not sure whether we've really been using base 3 or base 4. Oh well ...

Have fun, and good luck!

By the way in case you ever need efficient modulo 3 for this you can use the fact that 2 mod 3 == -1 mod 3, so the recipe is: de-interleave your input number into even and odd bits, and subtract the bit population counts — spam_eggs, Jan 10 '13 at 07:50