4

I am writing a tablebase for a Japanese chess variant. To index the table base, I encode each chess position as an integer. In one of the encoding steps, I encode where the pieces are on the board. Since the actual method is a bit complicated, let me explain the problem in a simplified manner.

The Encoding

In the endgame tablebase, I have (let's say) six distinct chess pieces that I want to distribute over a board with 9 squares. I can naïvely represent their positions by a six-tuple (a, b, c, d, e, f ) where each of the variables a to f is a number in the range 0 to 8 inclusive indicating where the corresponding chess piece is located.

However, this representation is not optimal: no two chess pieces can occupy the same square but the aforementioned encoding happily allows this. We can encode the same position by a six-tuple [a, b', c', d', e', f' ] where a is the same a as before, b' is a number from 0 to 7 inclusive indicating the number of the square the second piece is on. This works by assigning a number from 0 to 7 to each square the first piece is not on. For example, if the first piece is on square 3, the square numbers for the second piece are:

1st piece: 0 1 2 3 4 5 6 7 8
2nd piece: 0 1 2 - 3 4 5 6 7

the other pieces are encoded similarly, c' as a number from 0 to 6, d' as a number from 0 to 5, etc. For example the naïve encoding (5, 2, 3, 0, 7, 4) yields the compact encoding (5, 2, 2, 0, 3, 1):

1st: 0 1 2 3 4 5 6 7 8 --> 5
2nd: 0 1 2 3 4 - 5 6 7 --> 2
3rd: 0 1 - 2 3 - 4 5 6 --> 2
4th: 0 1 - - 2 - 3 4 5 --> 0
5th: - 0 - - 1 - 2 3 4 --> 3
6th: - 0 - - 1 - 2 - 3 --> 1

In my actual encoding, the number of pieces I want to encode is not fixed. The number of squares on the board however is.

The Question

How can I efficiently convert the naïve representation to the compact representation and vice versa? I use standard C99 for the program. In the context of this question, I am not interested in answers that use non-standard constructs, inline assembly or intrinsics.

Question Clarification

As there seems to be some confusion about the question:

  • The question is to find a practically efficient way to implement the conversion between the naïve and the compact position representations
  • Both representations are n-tuples of integers in certain ranges. The question is not about how to encode these representations into anything else.
  • In one of the cases I have, the number of squares is 25 and the number of pieces is up to 12. I am however interested in an implementation that works for a reasonable parameter space (e.g. up to 64 squares and up to 32 pieces).
  • I am not interested in alternative representations or encodings, especially representations or encodings that are not optimal.
  • Nor am I interested in remarks that the compact representation isn't worth the effort.
fuz
  • 88,405
  • 25
  • 200
  • 352
  • 1
    You're talking about your encoded representation in terms of its efficiency. I guess you mean its space efficiency, but I'm not seeing what actually makes it any smaller in practice than the naïve representation. If it is not smaller, then the ***in***efficiency of encoding and decoding that representation would seem to be pointless. – John Bollinger Sep 21 '16 at 17:50
  • @JohnBollinger The implementation should be *time* efficient as for my task the encoding function is part of the critical path the determines how many hours it takes to build the table base. The encoding itself is *space* efficient in that it reduces a larger encoding space into a smaller one (thus reducing the tablebase size), but its space efficiency isn't dependent on its implementation. – fuz Sep 21 '16 at 17:58
  • Yes, the naïve representation provides a value space that includes invalid values. *In principle*, a representation that affords only valid values could occupy less space, but whether it occupies less space in practice is a question of a lower level of encoding, which you have not discussed. One way to do that would be for the elements of your tuple to be variable-length, but even that doesn't gain much unless the number of pieces is a large fraction of the number of squares. – John Bollinger Sep 21 '16 at 18:16
  • 1
    Is the actual game played on a 8x8 board, so six out of 64 squares would be occupied in your example? – user3386109 Sep 21 '16 at 18:18
  • 2
    @user3386109 The game I am concerned about has a 5×5 board with up to 12 squares occupied. – fuz Sep 21 '16 at 18:19
  • 1
    @JohnBollinger I have computed that using the *compact* representation as opposed to the *naïve* representation can reduce the encoding space by a factor of 24. Since we are talking about a table several terabytes in size, this optimization is worthwhile. Please stay on topic though, whether what I want to do is very useful or only slightly is not an important part of the question. – fuz Sep 21 '16 at 18:23
  • 1
    @JohnBollinger Or to say it another way: I'm sick of people always wanting me to justify why I want to compute certain things. That's not part of the question I'm asking so don't steer the discussion towards it. – fuz Sep 21 '16 at 18:28
  • For space efficiency, simple "zip" the data. Yet the goal is space and time efficiency (without a clear factor rating time vs. space leading to lack of clarity in the question.) For a balanced approach and on a sparsely populated board with more than 16 spaces, I see little better than `uint8_t piece[12]` or if you like bit twiddling for a 5x5 board (25 places --> 5 bits) and 12 pieces, use 60 bits of a `uint64_t`. – chux - Reinstate Monica Sep 21 '16 at 18:31
  • 1
    A factor of 24??? Are you sure? How do you compare both representations? – Jean-Baptiste Yunès Sep 21 '16 at 18:31
  • For your 6 pieces in 9 positions the naïve solution requires 6*4 = 24 bits to represent. The next solution requires 4+3+3+3+3+2 = 18 bits to represent. If you pack it even further with no wastage it would need 9*8*7*6*5*4 = 60480 values to represent, now a 16-bit value. – Weather Vane Sep 21 '16 at 18:33
  • 6 positions in a space of size 9: you can encode each position into 4 bits, leading to 24 bits for each set of position, without any too strong (de)encoding. Isn't it sufficient? – Jean-Baptiste Yunès Sep 21 '16 at 18:33
  • @FUZxxl, you have a very clever encoding of this representation indeed if it's saving you more than about 5%. But I will not go into further detail, since you do not want to talk about it. – John Bollinger Sep 21 '16 at 18:33
  • what is the real number of pieces and the size of the space? – Jean-Baptiste Yunès Sep 21 '16 at 18:35
  • 1
    If math serves correct, up to 12 pieces on a 5x5 boards is more combinations than 2^32, but less than 2^64. If goal is a fixed size record. looks like the fastest 64-bit encoding/decoding is the way to go. `uint64_t state; pos = 31 & (state >> (piece_index*5));` – chux - Reinstate Monica Sep 21 '16 at 18:39
  • I think you've already answered the question in the last snippet. Create an array where the values are initially equal to the indexes. When you use a square, take its value from the array, and decrement all the values to the right. If the board size was huge, you could optimize this with binning. But with only 25 entries in the array, a simple loop is going to be the fastest. – user3386109 Sep 21 '16 at 18:42
  • @chux I have tested a naïve encoding with gzip. This is extremely slow as I have to go through an extremely large number of invalid encodings (which you need to decode to determine that they are invalid) while building the table. Also, gzip streams aren't seekable. While this can be avoided with some tricks, it turns out that rolling out an effective encoding and custom compression can achieve better time and space performance than that. However, all of this is not part of the question. – fuz Sep 21 '16 at 18:42
  • @chux Please stop suggesting alternative approaches. That's not what this question is about. I do not want to hear about your alternative approaches. I have considered them and determined that I want to do is the best choice for the problem at hand. Please stay on topic. – fuz Sep 21 '16 at 18:43
  • 1
    @Jean-BaptisteYunès 25 squares and up to 12 pieces, with the number of pieces being known in advance from another part of the encoding. – fuz Sep 21 '16 at 18:44
  • Not sure but you may not search to encode efficiently each position of the game but encode difference in between two positions (moves) efficiently. If this is fixed size then a seek is easy, if not, you can *cut* your data into chunks of known size and begin each chunk with the full description of the position... – Jean-Baptiste Yunès Sep 21 '16 at 18:45
  • @Jean-BaptisteYunès I am not interested in suggestions for alternative encodings. Please stay on topic. – fuz Sep 21 '16 at 18:45
  • 2
    OP commented that the question does not contain the actual constraints. For your 12 pieces in 25 positions the naïve solution requires 12*5 = 60 bits to represent. The next solution requires (10*5) + (2*4) = 58 bits to represent. If you pack it even further with no wastage it would need 25*24*23*22*21*20*19*18*17*16*15*14 = 2490952020480000 values to represent, which is still a 64-bit value. – Weather Vane Sep 21 '16 at 18:46
  • @WeatherVane That is unimportant. Please stay on topic. – fuz Sep 21 '16 at 18:46
  • 1
    @FUZxxl you have confused the question with the new constraints. Please advise the correct topic. I previously commented that a 16-bit value could encode ***the question***. – Weather Vane Sep 21 '16 at 18:47
  • @WeatherVane The correct topic is to find an efficient (as in: actually efficient) way to implement the encoding and decoding functions described in the question in the C99 programming language as indicated in the **The Question** section. – fuz Sep 21 '16 at 18:49
  • @WeatherVane The input and output of the encoding and decoding functions are six-tuples of integers in certain ranges, not 16 bit values. Please do not describe alternative functions, I am not interested in them unless they are then immediately used to implement the function I am interested in. – fuz Sep 21 '16 at 18:50
  • I am sorry, I thought you wanted to encode an efficient look-up table. Encoding each position as a 16-bit integer, is more efficient than using 24-bit. – Weather Vane Sep 21 '16 at 18:51
  • "However, this representation is not efficient:" implies an efficiency of _space_. Striking this line would help us all as it is not relevant to the goal of "How can I efficiently convert" (a time efficiency) from one form to another. – chux - Reinstate Monica Sep 21 '16 at 18:56
  • @chux This sentence is part of the motivation that introduces the compact representation. Striking it removes the motivation and makes the question unclear. Let me replace “efficient” with “optimal” instead. – fuz Sep 21 '16 at 18:57
  • @FUZxxl OK I deleted that. But I gave a perfectly clear comment as how an efficient - which you actually hedge at defining - compression can be to only 16 bits, for your example. – Weather Vane Sep 21 '16 at 18:58
  • @WeatherVane That encoding is less efficient as it contains codes that do not correspond to any position. It is important that the number of such codes is as low as possible as I need to iterate over all positions/encodings. However, as I said, other encodings/representations are not the topic of this question. – fuz Sep 21 '16 at 19:00
  • @FUZxxl well you did not digest it. There is no wastage whatsoever in the 9*8*7*6*5*4 = 60480 different positions, which is as low as you can get for the original question (subject to any posssible illegal positions). – Weather Vane Sep 21 '16 at 19:04
  • Posting the not-quite-efficient-enough C code that you have that converts from one form into the other would provide an example of the bottlenecks code is experiencing. I would be interesting in then seeing how bit fields, bit-dibbling or other standard C constructs would apply or is this just an algorithm post? – chux - Reinstate Monica Sep 21 '16 at 19:05
  • 1
    @WeatherVane Sorry, I confused your answer. Point is, I don't actually need to encode the representation into bytes as I never store a representation. The encoded representation is used as an index into a (very large) array (with each array entry containing some knowledge about that position), so it's the total number of possible encodings that matters. Invalid encodings mean that space is wasted in the array, so I need to avoid these. – fuz Sep 21 '16 at 19:15
  • 1
    @chux I actually haven't implemented it yet. The naïve approach to implement this involves two nested loops and runtime of order O(n*k) where *n* is the number of squares and *k* is the number of pieces, which is clearly not really fast. Since I wasn't able to come up with a better idea to implement this conversion of representations, I asked this question. – fuz Sep 21 '16 at 19:18
  • @FUZxxl that was my (not clearly made) point. I meant that `60480` different positions requires the smallest possible table, whatever the *content* of that table is, and there are no unused index values (below `60481`). – Weather Vane Sep 21 '16 at 19:26
  • @WeatherVane Indeed. My comment concerned your comment about 2490952020480000 still being a 64 bit value. Using a lookup table to compute the function is possible in the simplified example I gave in the question but it is not applicable in the actual case I want to use it in since the number of possible representations is just too large. – fuz Sep 21 '16 at 19:28
  • Now I spent an hour explaining why I want to compute this. Why can't all of you just believe that the question I ask is the question I want to know an answer for without doubting that I know what I am doing? – fuz Sep 21 '16 at 19:33
  • Going from "naïve" to "compact" can be done in `O((k+n)*log(n))` time by using an array `n` into a height balanced BST. Uncertain if the added complexity is worth the order-of-complexity savings. – chux - Reinstate Monica Sep 21 '16 at 19:38
  • 2
    A lot of brain juice was spent on this intricate question, well deserving a round of up-votes. – chqrlie Sep 21 '16 at 22:13
  • @Jean-Baptiste Yunès To your question about the 24 fold improvement: that's the size difference of the encoding space. The naïve encoding has a 24 times larger space than the compact one for 12 pieces on 25 squares. That means that with the naïve encoding, my database would be 24 times larger than with the compact one. – fuz Sep 21 '16 at 22:32

6 Answers6

3

The naive solution to the problem: create an array where the values are initially equal to the indexes. When you use a square, take its value from the array, and decrement all the values to the right. The running time of this solution is O(n*p) where n is the number of squares on the board and p is the number of pieces on the board.

int codes[25];

void initCodes( void )
{
    for ( int i = 0; i < 25; i++ )
        codes[i] = i;
}

int getCodeForLocation( int location )
{
    for ( int i = location + 1; i < 25; i++ )
        codes[i]--;
    return codes[location];
}

You can attempt to improve the performance of this code with binning. Consider the locations on the board as 5 bins of 5 locations each. Each bin has an offset and each location in a bin has an value. When a value is taken from bin y at location x, then the offsets for all bins below y are decremented. And all values to the right of x in bin y are decremented.

int codes[5][5];
int offset[5];

void initCodes( void )
{
    int code = 0;
    for ( int row = 0; row < 5; row++ )
    {
        for ( int col = 0; col < 5; col++ )
            codes[row][col] = code++;
        offset[row] = 0;
    }
}

int getCodeForLocation( int location )
{
    int startRow = location / 5;
    int startCol = location % 5;
    for ( int col = startCol+1; col < 5; col++ )
        codes[startRow][col]--;
    for ( int row = startRow+1; row < 5; row++ )
        offset[row]--;
    return codes[startRow][startCol] + offset[startRow];
}

The running time of this solution is O(sqrt(n) * p). However, on a board with 25 squares, you won't see much improvement. To see why consider the actual operations done by the naive solution versus the binned solution. Worst case, the naive solution updates 24 locations. Worst case, the binned solution updates 4 entries in the offset array, and 4 locations in the codes array. So that seems like a 3:1 speedup. However, the binned code contains a nasty division/modulo instruction, and is more complicated overall. So you might get a 2:1 speedup if you're lucky.

If the board size was huge, e.g. 256x256, then binning would be great. The worst case for the naive solution would be 65535 entries, whereas binning would update a maximum of 255+255=510 array entries. So that would definitely make up for the nasty division and increased code complexity.

And therein lies the futility of trying to optimize small problem sets. You don't save much changing O(n) to O(sqrt(n)) or O(log(n)) when you have n=25 sqrt(n)=5 log(n)=5. You get a theoretical speedup, but that's almost always a false savings when you consider the myriad constant factors that big-O so blithely ignores.


For completeness, here's the driver code that can be used with either snippet above

int main( void )
{
    int locations[6] = { 5,2,3,0,7,4 };
    initCodes();
    for ( int i = 0; i < 6; i++ )
        printf( "%d ", getCodeForLocation(locations[i]) );
    printf( "\n" );
}

Output: 5 2 2 0 3 1

user3386109
  • 34,287
  • 7
  • 49
  • 68
  • Interesting. It seems like your binning technique would be more difficult to apply to *de*coding (not that that would in any way invalidate it). Do you disagree? – John Bollinger Sep 21 '16 at 20:27
  • @JohnBollinger TBH, I haven't given any thought to decoding. I assume that OP already has a satisfactory decoding algorithm and only needs an efficient encoding algorithm, i.e. OP just wants to know how to convert `(5, 2, 3, 0, 7, 4)` to `(5, 2, 2, 0, 3, 1)`. However, your question is interesting, let me think about it. – user3386109 Sep 21 '16 at 20:44
  • @JohnBollinger I believe that binning can be used, but haven't worked out all the details. If you want to start a new question, I'll see what I can do. But I'll expect an upvote and a shiny green thing, feeling under appreciated here :( – user3386109 Sep 21 '16 at 21:36
  • @user3386109, yes, not too much appreciation going around on this Q. The decode binning is not so important to me as to ask you to work out the details, but I gave you an UV on this answer. – John Bollinger Sep 21 '16 at 21:52
  • 1
    @JohnBollinger Thanks, I just finished working out the details, and binning definitely works for decoding. – user3386109 Sep 21 '16 at 21:55
  • Interesting approach. I think one could avoid the modulo by using four bins of size eight with the last bin containing only one element. – fuz Sep 21 '16 at 22:28
  • @FUZxxl Good point. The optimal number of bins is `sqrt(n)`, but avoiding the nasty division/modulo might justify using a slightly non-optimal bin count. So the question is, "How bad is a division on a modern processor?" That can only be answered by profiling the code. – user3386109 Sep 21 '16 at 22:35
  • 1
    @user3386109 Division is terribly slow though the compiler can do some tricks to get it down to 20 or so cycles. If the optimal number is sqrt(n), then surely seven bins of four each is good, too. – fuz Sep 21 '16 at 22:37
  • @FUZxxl Yup, I agree. – user3386109 Sep 21 '16 at 22:48
  • @user3386109: division by a constant is usually implemented as a multiplication with a minimal adjustment, much less than 20 cycles. You can avoid the modulo with by subtracting the multiplied quotient. – chqrlie Sep 21 '16 at 23:44
  • @chqrlie Looking at the optimized assembly code, you are correct. There's a multiplication followed by 8 additional simple instructions. On the other hand, the `for` loops are exactly 4 instructions long. On a modern processor, I'm not sure how that trade-off (multiply/adjust versus a couple extra passes through a short loop) works out. So I agree with FUZxxl that seven bins of four each should be good, too. If I had a large chunk of input data, I would profile the code to see how that trade-off works out in the real world. – user3386109 Sep 22 '16 at 00:31
3

I have found a more elegant solution for up to 16 positions using 64-bit integers with a single loop for both encoding and decoding:

#include <stdio.h>
#include <stdlib.h>

void encode16(int dest[], int src[], int n) {
    unsigned long long state = 0xfedcba9876543210;
    for (int i = 0; i < n; i++) {
        int p4 = src[i] * 4;
        dest[i] = (state >> p4) & 15;
        state -= 0x1111111111111110 << p4;
    }
}

void decode16(int dest[], int src[], int n) {
    unsigned long long state = 0xfedcba9876543210;
    for (int i = 0; i < n; i++) {
        int p4 = src[i] * 4;
        dest[i] = (state >> p4) & 15;
        unsigned long long mask = ((unsigned long long)1 << p4) - 1;
        state = (state & mask) | ((state >> 4) & ~mask);
    }
}

int main(int argc, char *argv[]) {
    int naive[argc], compact[argc];
    int n = argc - 1;

    for (int i = 0; i < n; i++) {
        naive[i] = atoi(argv[i + 1]);
    }

    encode16(compact, naive, n);
    for (int i = 0; i < n; i++) {
        printf("%d ", compact[i]);
    }
    printf("\n");

    decode16(naive, compact, n);
    for (int i = 0; i < n; i++) {
        printf("%d ", naive[i]);
    }
    printf("\n");
    return 0;
}

The code uses 64-bit unsigned integers to hold arrays of 16 values in the range 0..15. Such an array can be updated in parallel in a single step, extracting a value is straightforward and deleting a value is a bit more cumbersome but still only a few steps.

You could extend this method to 25 positions using non-portable 128-bit integers (type __int128 is supported by both gcc and clang), encoding each position on 5 bits, taking advantage of the fact that 5 * 25 < 128, but the magical constants are more cumbersome to write.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Very interesting! – fuz Sep 21 '16 at 22:06
  • 1
    You can also extend the method by using multiple 64-bit values. The (quite clever) optimization here is that you're updating multiple counters with a single subtraction. With 4 bit counters, you get 16 subtractions for the price of one. With 5 bit counters, you get 12 subtractions for the price of one, etc. – user3386109 Sep 21 '16 at 22:26
  • @user3386109: true, but unfortunately, you cannot handle 25 positions with two 64-bit words because each word can only hold 12 complete positions... If the grid is square, the next size up from 4x4 is 5x5, requiring 3 words, then 6x6 requiring 4 words, 7x7 requiring 5 words and 8x8 requiring 7 words. Parallel subtraction is still reasonably simple, but extraction and deletion are more cumbersome with multiple words. – chqrlie Sep 21 '16 at 22:45
  • @chqrlie That's true. You could put an 8x8 board into eight 64-bit words, with an 8:1 speed up on the subtractions. But I agree that the 5x5 case is rather annoying because it doesn't quite fit into 2 words. – user3386109 Sep 21 '16 at 22:58
  • I benchmarked this vs. the implementation of the naive approach in my own answer. I generated 100000000 random valid inputs of length 12, and tested the times to encode them all and to decode them all, averaging the results over four trials with different input sets. The naive encoding averaged 5.07 ns on my hardware for encoding each input and 6.28 ns for decoding. But **this** implementation was much faster at 1.10 ns per input to encode, and 1.86 ns per input to decode. (gcc 4.4.7, optimizing at level -O3). – John Bollinger Sep 22 '16 at 14:47
  • @JohnBollinger: my pleasure ;-) – chqrlie Sep 22 '16 at 14:51
  • It will not surprise anyone that the difference is smaller when the input length is shorter. In fact, I find that the naive decoding is actually slightly faster (< 10% difference) for inputs of length 3, though the naive *en*coding is still more than twice as slow at that length. – John Bollinger Sep 22 '16 at 14:55
  • In my own tests, this was the fastest approach that doesn't need non-standard extensions (if your platform has a 128 bit integer that is). I am going to write up an answer with some other ideas I had, but yours is the best according to the constraints I set out. – fuz Sep 26 '16 at 14:33
2

Your encoding technique has the property that the value of each element of the output tuple depends on the values of the corresponding element and all preceding elements of the input tuple. I don't see a way to accumulate partial results during computation of one encoded element that could be reused in computation of a different one, and without that, no computation of the encoding can scale more (time) efficiently than o(n2) in the number of elements to be encoded. Therefore, For the problem size you describe, I don't think you can do much better than this:

typedef <your choice> element_t;

void encode(element_t in[], element_t out[], int num_elements) {
    for (int p = 0; p < num_elements; p++) {
        element_t temp = in[p];

        for (int i = 0; i < p; i++) {
            temp -= (in[i] < in[p]);
        }

        out[p] = temp;
    }
}

The corresponding decoding could be done like this:

void decode(element_t in[], element_t out[], int num_elements) {
    for (int p = 0; p < num_elements; p++) {
        element_t temp = in[p];

        for (int i = p - 1; i >= 0; i--) {
            temp += (in[i] <= temp);
        }

        out[p] = temp;
    }
}

There are approaches that scale better, some of them discussed in comments and in other answers, but my best guess is that your problem size is not large enough for their improved scaling to overcome their increased overhead.

Obviously, these transformation do not themselves change the size of the representation at all. The encoded representation is easier to validate, however, because each position in a tuple can be validated independently from the others. For that reason, the whole space of valid tuples also can be enumerated much more efficiently in the encoded form than in the decoded form.

I continue to maintain that the decoded form can be stored almost as efficiently as the encoded form, especially if you want to be able to address individual position descriptions. If your objective for the encoded form is to support bulk enumeration, then you could consider enumerating tuples in the "encoded" form, but storing and subsequently using them in the decoded form. The small amount of extra space needed might very well be worth it for the benefit of not needing to perform the decoding after reading, especially if you plan to read a lot of these.


Update:

In response to your comment, the elephant in the room is the question of how you convert the encoded form to a single index such as you describe, such that there are as few unused indices as possible. I think that is the disconnect that spawned so much discussion that you considered off-topic, and I presume that you have some assumptions about that feeding into your assertion of a 24x space savings.

The encoded form is more easily converted to a compact index. For example, you can treat the position as a little-endian number with the board size as its radix:

#define BOARD_SIZE 25
typedef <big enough> index_t;

index_t to_index(element_t in[], int num_elements) {
    // The leading digit must not be zero
    index_t result = in[num_elements - 1] + 1;

    for (int i = num_elements - 1; i--; ) {
        result = result * BOARD_SIZE + in[i];
    }    
}

There are still gaps in that, to be sure, but I estimate them to constitute a reasonably small proportion of the overall range of index values used (and arranging for that to be so is the reason for taking a little-endian interpretation). I leave the reverse transformation as an exercise :).

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • This is basically the code that I came up with while thinking about the problem, too (though I didn't write it down yet). I wonder if there is any faster way to compute the conversions... Note that in my use case, I do not actually store the encoded positions. The position code is used as an index into the table base (basically a very large, potentially Huffman encoded array) which contains information about each position. Thus the most important property I need is having as little invalid codes as possible as each invalid code is wasted space. – fuz Sep 21 '16 at 19:31
  • @FUZxxl, I don't think the conversion you asked about can be done significantly more efficiently. The conversion you *did not* ask about may be of interest, however -- I refer to converting the encoded output to an index into your tablebase. I have added some commentary about that. Perhaps you already reached a similar conclusion, but perhaps the update will give you something new to think about. – John Bollinger Sep 21 '16 at 20:11
  • To your update: I have thought about that. Point is, using the compressed representation discussed in the question reduces the table base size by a factor of up to 24, so that's definitely a very useful piece. For the rest, I do have efficient means but they are not part of this question and I do not like to discuss this here. – fuz Sep 21 '16 at 20:27
  • 1
    To further reduce the size of the encoded compact form, you could use arithmetic encoding with a decreasing base for each position. If the number of pieces is close to the number of positions, this would greatly reduce the resulting table size: `for (result = 0, i = 0; i < num_elements; i++) { result = result * (BOARD_SIZE - i) + compact[i]; }` – chqrlie Sep 21 '16 at 22:33
2

In this answer, I want to show some of my own ideas for implementing the conversions as well as some benchmarking results.

You can find the code on Github. These are the results on my main machine:

algorithm   ------ total  time ------  ---------- per  call -----------
            decoding encoding total    decoding   encoding   total
baseline    0.0391s  0.0312s  0.0703s    3.9062ns   3.1250ns   7.0312ns
count       1.5312s  1.4453s  2.9766s  153.1250ns 144.5312ns 297.6562ns
bitcount    1.5078s  0.0703s  1.5781s  150.7812ns   7.0312ns 157.8125ns
decrement   2.1875s  1.7969s  3.9844s  218.7500ns 179.6875ns 398.4375ns
bin4        2.1562s  1.7734s  3.9297s  215.6250ns 177.3438ns 392.9688ns
bin5        2.0703s  1.8281s  3.8984s  207.0312ns 182.8125ns 389.8438ns
bin8        2.0547s  1.8672s  3.9219s  205.4688ns 186.7188ns 392.1875ns
vector      0.3594s  0.2891s  0.6484s   35.9375ns  28.9062ns  64.8438ns
shuffle     0.1328s  0.3438s  0.4766s   13.2812ns  34.3750ns  47.6562ns
tree        2.0781s  1.7734s  3.8516s  207.8125ns 177.3438ns 385.1562ns
treeasm     1.4297s  0.7422s  2.1719s  142.9688ns  74.2188ns 217.1875ns
bmi2        0.0938s  0.0703s  0.1641s    9.3750ns   7.0312ns  16.4062ns

Implementations

  • baseline is an implementation that does nothing except reading the input. It's purpose is measuring function call and memory access overhead.
  • count is a “naïve” implementations that stores an occupancy map indicating which squares have pieces on them already
  • bitcount is the same thing but with the occupancy map stored as a bitmap. __builtin_popcount is used for encoding, speeding things up considerably. If one uses a hand-written popcount instead, bitcount is still the fastest portable implementation of encoding.
  • decrement is the second naïve implementation. It stores the encoding for each square of the board, after adding a piece all square numbers to the right are decremented.
  • bin4, bin5, and bin8 use binning with bins sized 4, 5, and 8 entries as suggested by user3386109.
  • shuffle computes a slightly different encoding based on the Fisher-Yates shuffle. It works by reconstructing the random values that would have went into a shuffle generating the permuation we want to encode. The code is branchless and fast, in particular when decoding.
  • vector uses a vector of five bit numbers as suggested by chqrlie.
  • tree uses a difference tree which is a data structure I made up. It's a full binary tree of depth ⌈log2n⌉ where the leaves represent each square and the inner nodes on the path to each leave sum to the code of that square (only the nodes where you go right are added). The square numbers are not stored, leading to n − 1 words of extra memory.

    With this data structure, we can compute the code for each square in ⌈log2n⌉ − 1 steps and mark a square as occupied in the same number of steps. The inner loop is very simple comprising a branch and two actions, depending on whether you descend to the left or to the right. On ARM, this branch compiles to a few conditional instructions, leading to a very fast implementation. On x86, neither gcc nor clang are smart enough to get rid of the branches.

  • treeasm is a variant of tree that uses inline assembly to implement the inner loop of tree without branches by carefully manipulating the carry flag.
  • bmi2 uses the pdep and pext instructions from the BMI2 instruction set to implement the algorithm in a very fast manner.

For my actual project, I'm probably going to use the shuffle implementation since it is the fastest one that does not depend on any unportable extensions (such as Intel intrinsics) or implementation details (such as the availability of 128 bit integers).

Community
  • 1
  • 1
fuz
  • 88,405
  • 25
  • 200
  • 352
  • @facetus I have not! Though as far as I'm concerned, the Beneš network does not perform a 1:1 mapping of permutations to integers in the range (0, n! - 1) so it's not really useful for my purpose. Feel free to add an answer about Beneš networks though; it's certainly going to be interesting to a lot of people! – fuz Sep 03 '20 at 21:48
  • Yes, it uses (n lg(n) - n + 1) bits, instead of lg(n!). I wonder if it has any speed benefits as a compromise. I am working on it. Will be glad to share the results. – facetus Sep 04 '20 at 04:20
  • @facetus The main problem is that I build a table with one entry per permutation for my use case. So if the encoding uses more bits than needed, I waste considerable amounts of space. – fuz Sep 04 '20 at 07:47
  • Sure, no problem. I won't share anything then. – facetus Sep 04 '20 at 11:48
  • @facetus It might still be interesting to other readers stumbling upon this question. Do as you like. – fuz Sep 04 '20 at 13:57
  • Relax, I was just pulling your lag. I am not doing experiments for you in the first place, and of course I don't need a permission to do whatever I want. – facetus Sep 04 '20 at 22:51
1

To convert from naive to compact position, you can iterate over the n-tuple and perform these steps for each position p:

  1. optionally check that position p is available
  2. set position p as busy
  3. subtract from p the number of lower positions that are busy
  4. store the result into the destination n-tuple

You can do this by maintaining an array of n bits for the busyness state:

  • step 1, 2 and 4 are computed in constant time
  • step 3 can be computed efficiently if the array is small, ie: 64 bits.

Here is an implementation:

#include <stdio.h>
#include <stdlib.h>

/* version for up to 9 positions */
#define BC9(n)  ((((n)>>0)&1) + (((n)>>1)&1) + (((n)>>2)&1) + \
                 (((n)>>3)&1) + (((n)>>4)&1) + (((n)>>5)&1) + \
                 (((n)>>6)&1) + (((n)>>7)&1) + (((n)>>8)&1))
#define x4(m,n)    m(n), m((n)+1), m((n)+2), m((n)+3)
#define x16(m,n)   x4(m,n), x4(m,(n)+4), x4(m,(n)+8), x4(m,(n)+12)
#define x64(m,n)   x16(m,n), x16(m,(n)+16), x16(m,(n)+32), x16(m,(n)+48)
#define x256(m,n)  x64(m,n), x64(m,(n)+64), x64(m,(n)+128), x64(m,(n)+192)

static int const bc512[1 << 9] = {
    x256(BC9, 0),
    x256(BC9, 256),
};

int encode9(int dest[], int src[], int n) {
    unsigned int busy = 0;
    for (int i = 0; i < n; i++) {
        int p = src[i];
        unsigned int bit = 1 << p;
        //if (busy & bit) return 1;  // optional validity check
        busy |= bit;
        dest[i] = p - bc512[busy & (bit - 1)];
    }
    return 0;
}

/* version for up to 64 positions */
static inline int bitcount64(unsigned long long m) {
    m = m - ((m >> 1) & 0x5555555555555555);
    m = (m & 0x3333333333333333) + ((m >> 2) & 0x3333333333333333);
    m = (m + (m >> 4)) & 0x0f0f0f0f0f0f0f0f;
    m = m + (m >> 8);
    m = m + (m >> 16);
    m = m + (m >> 16 >> 16);
    return m & 0x3f;
}

int encode64(int dest[], int src[], int n) {
    unsigned long long busy = 0;
    for (int i = 0; i < n; i++) {
        int p = src[i];
        unsigned long long bit = 1ULL << p;
        //if (busy & bit) return 1;  // optional validity check
        busy |= bit;
        dest[i] = p - bitcount64(busy & (bit - 1));
    }
    return 0;
}

int main(int argc, char *argv[]) {
    int src[argc], dest[argc];
    int cur, max = 0, n = argc - 1;

    for (int i = 0; i < n; i++) {
        src[i] = cur = atoi(argv[i + 1]);
        if (max < cur)
            max = cur;
    }
    if (max < 9) {
        encode9(dest, src, n);
    } else {
        encode64(dest, src, n);
    }
    for (int i = 0; i < n; i++) {
        printf("%d ", dest[i]);
    }
    printf("\n");
    return 0;
}

The core optimisation is in the implementation of bitcount(), which you can tailor to your needs by specializing it to the actual number of positions. I posted above efficient solutions for small numbers upto 9 and large numbers upto 64, but you can craft a more efficient solution for 12 or 32 positions.

In terms of time complexity, in the general case, we still have O(n2), but for small values of n, it actually runs in O(n.Log(n)) or better, since the implementation of bitcount() in parallel can be reduced to log(n) steps or less for n up to 64.

You can look at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive for inspiration and amazement.

Unfortunately, I'm still looking for ways to use this or a similar trick for decoding...

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Having benchmarked various approaches over the weekend I can confirm that this is the fastest approach for encoding. Decoding is indeed tricky and I haven't found a faster approach than the naïve one that doesn't require inline assembly (needed for an approach using a difference tree since gcc isn't smart enough to make use of the carry flag for branch-less code) or fancy intrinsics (`pdep` and `pext` blow this problem away). – fuz Sep 26 '16 at 13:18
  • Interestingly, a difference-tree based approach is very fast for decoding on ARM since it has conditional execution and can thus avoid branches in the critical path. – fuz Sep 26 '16 at 13:22
  • After further testing, the vector-approach works fine, too. – fuz Sep 26 '16 at 14:32
0

To go from (5, 2, 3, 0, 7, 4) to (5, 2, 2, 0, 3, 1) you just have to :

  • start with (5, 2, 3, 0, 7, 4), push five in the result (5)
  • take 2 and count the number of preceding values less than 2, 0 then push 2-0: (5, 2)
  • take 3, count the number of preceding values less than 3, 1 then push 3-1: (5, 2, 2)
  • take 0, count the number of preceding values less than 0, 0 then push 0-0 (5,2, 2, 0)
  • take 7, count..., 4 then push 7-4: (5,2,2,0,3)
  • take 4, count..., 3 then push 4-3: (5,2,2,0,3,1)
Jean-Baptiste Yunès
  • 34,548
  • 4
  • 48
  • 69
  • This approach is the obvious one, but it is pretty slow as you have to count values at every step. Do you have an idea how to implement this conversion any faster? – fuz Sep 21 '16 at 19:21
  • You may use some obvious shortcuts by retaining the last greatest value and last minimal one... Maybe some tree structure but does it worth the effort? – Jean-Baptiste Yunès Sep 21 '16 at 19:43
  • 1
    That's why I'm asking. – fuz Sep 21 '16 at 19:45
  • Yes a binary tree. At each point you know the value and the number of values less and the number of values greater than the node itself, but constructing and descending the tree may not be very interesting comparing to the naive algorithm... – Jean-Baptiste Yunès Sep 21 '16 at 19:48
  • Such a tree is a solution but I am actually not convinced that for 12 pieces is worth the effort... There is too much overhead in building a tree. – Jean-Baptiste Yunès Sep 21 '16 at 19:53
  • 1
    I have implemented a solution using a difference tree; it is quite fast but there are even faster solutions. – fuz Sep 26 '16 at 14:28