6

I'm trying to implement in C a data structure that would allow me to manipulate efficiently a**binary** matrix (containing only 1 or 0). I'll explain what operations I have to apply to this matrix, and would like to know what's the best possible data structure to use ?

The operations are done in the field F_2 (which means 1+1 = 0 the other operations remain unchanged). I have one k*n matrix (k < n) called H. At most, k = 2325 and n = 3009.

The operations that I will have to do over this matrix are :

I will partially diagonalize it using only row swap and row additions. Once it is done, I will not use anymore row operations and will operate a lot (!) of columns additions over this matrix (what I mean by "a lot" is about ((n-k)/2)³ columns additions)

Data structure I was considering for the matrix :

For the matrix coefficient, I was thinking about storing sequences of multiple bits at once in one single unsigned int. For instance, I could store the sequence (11001011) to the uint8_t 203 (converting from binary to decimal)

  • Is it a good idea ?

If I do so, I have two options :

I could use uint16_t or uint64_t coefficients to split my matrix H in many 4*4 or 8*8 submatrices.

  • Is this a good option (in terms of time efficiency) and if it is, is it better to use uint16_t or uint64_t ?

Else I was thinking about storing every row in multiple uint32_t or uint64_t, then operate my partial diagonalization. Next switch to a structure that would code the matrix as n columns vectors to handle the remaining operations.

  • Do you think this is more efficient ?

Whatever method I use, I will have to efficiently access the n'th bit of an unsigned int (uint16, 32 or 64). How do I do that ?

Celerio
  • 803
  • 6
  • 11
  • Could you please to explain which is the role of your binomial coefficients? The final result is an F_2 matrix? – pablo1977 Aug 02 '14 at 20:07
  • I've edited to add some clarity. The binomial coefficients I were talking about were an indication of time complexity and had nothing to do in the data structure. There's no "final result" in such terms, I'm just looking for the best way to store a matrix containing only `1` and `0`, being given the operations that I will need to apply to it later. – Celerio Aug 03 '14 at 10:54

3 Answers3

6

For best performance, use an array of row pointers for the row swap and row additions. Use <stdint.h>, and a fast unsigned integer type of minimum supported word size -- I recommend either uint_fast32_t, unless you intend to run this on 16- or 8-bit processors.

When all the row swap and row additions are done, transpose the array. Although this operation is "slow", the following column operations will be so fast to offset the transpose cost.

Consider the following:

#include <stdint.h>
#include <limits.h>

typedef uint_fast32_t  word;
#define WORD_BITS      (CHAR_BIT * sizeof (word))

typedef struct {
    int    rows;  /* Number of row vectors */
    int    cols;  /* Number of defined bits in each row */
    int    words; /* Number of words per row vector */
    word **row;   /* Array of pointers */
} row_matrix;

typedef struct {
    int    rows;  /* Number of defined bits in each column */
    int    cols;  /* Number of col vectors */
    int    words; /* Number of words per col vector */
    word **col;
} col_matrix;

Although you could use a single type to describe the two matrix forms, using separate types makes the code and functions easier to maintain. You'll end up having some duplicated code, but that's a minor issue compared to having clear, intuitive types.

On 32-bit systems, uint_fast32_t is typically a 32-bit type. On 64-bit systems, it is typically 64-bit. The WORD_BITS macro expands to the number of bits in a word -- it is NOT always 32!

The easiest way to number the bits is to designate the leftmost bit in the matrix as bit 0, and store the bits in the least significant bits in each word. If you have row_matrix *rm, then the bit at row row, column col is

!!(rm->row[row][col / WORD_BITS] & ((word)1U << (col % WORD_BITS)))

The !! is the not-not operator: if the argument is nonzero, it yields 1, otherwise it yields 0. Because we mask a single bit from the word, the "bit is set" value would otherwise be some power of two (1, 2, 4, 8, 16, 32, 64, and so on).

To set the bit, use

rm->row[row][col / WORD_BITS] |= (word)1U << (col % WORD_BITS);

To clear a bit, you need to binary-AND with a mask having all except the target bit 1. This is easy to achieve using the not operator ~:

rm->row[row][col / WORD_BITS] &= ~((word)1U << (col % WORD_BITS));

The corresponding operations for the col_matrix *cm are

!!(cm->col[col][row / WORD_BITS] & ((word)1U << (row % WORD_BITS)))
cm->col[col][row / WORD_BITS] |= (word)1U << (row % WORD_BITS);
cm->col[col][row / WORD_BITS] &= ~((word)1U << (row % WORD_BITS));

Although the division / and modulus (or remainder) % are generally slow-ish (compared to addition, subtraction, and even multiplication), here WORD_BITS will be a power of two compile-time constant on all widely used architectures. All compilers I know of will turn the above into fast bit shifts and binary-AND operators.

To add row srcrow to row dstrow, you simply do a binary exclusive-or on all the words:

{
    const word *const src = rm->row[srcrow];
    word *const       dst = rm->row[dstrow];
    int               w = rm->words;
    while (w-->0)
        dst[w] ^= src[w];
}

and analogously, for the column matrix,

{
    const word *const src = cm->col[srccol];
    word *const       dst = cm->col[dstcol];
    int               w = cm->words;
    while (w-->0)
        dst[w] ^= src[w];
}

Note that if you combine more than two rows, you can do so very efficiently; it'll be MUCH faster than doing the additions consecutively. Intel and AMD CPUs are very good at predicting the above pattern, so you can just use more than one source row/column. Also, the destination does not have to participate in the result, although if I guess correctly what algorithm you're implementing, I guess you want it to.

If you know the target architectures have SSE2 or better, or even AVX, you can use the emmintrin.h or immintrin.h header files, respectively, for compiler built-in types and operators that allow you to XOR 128 bits and 256 bits, respectively, at once; sometimes giving you quite a bit of a boost.

Since the vector types require what the C standards call "excess alignment", you'll then also need to include mm_malloc.h, and use _mm_malloc() and _mm_free() to allocate the row/column vectors for the word data -- and obviously round words up so you can access the row/column as a suitable integer word type (__m128i for SSE*, __m256i for AVX).

Personally, I always implement the unvectorized version first, then some "nasty" test cases for unit testing, and only then see about vectorizing it. This has the benefit that you can give the unvectorized version as a preliminary version for those who will be using it, and you can compare the test case results between the vectorized and unvectorized cases, to see if one or the other has a bug in it.

The transpose operation is quite straightforward, although I do recommend using a triple loop: innermost looping over the bits in a single word. Also, you'll probably want to check which order -- row or column major -- works best for the outer loop; depending on the matrix size, you may see a huge difference. (This is due to cache behaviour: you want the CPU to be able to predict the access pattern, and not have to reload the same cachelines. In the best case, on at-most-few-years-old AMD and Intel x86-64 processors, you can get near cache speeds if both matrixes fit in the cache.)

All of the above can be implemented in a single header file -- even including the vectorized versions if the target architecture supports SSE2/AVX --, so it should not be too difficult to implement.

Questions?

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • Now that's what I call an answer ! Thanks a lot ! Most of what you show is pretty close to what I did yesterday. I'll use your post to correct some mistakes I made. I'll switch to 2 data structures and will use `uint_fast32_t` About the SSE2/AVX part, you lost me there, but it doesn't matter much, I'll implement unvectorized first. Plus I don't know if the target computer supports SSE2/AVX. For the questions, I have two simple ones : Is `uint_fast32_t` always gonna be the best type, independant of processor architecture ? Why use 1u << a instead of just 1 << a ? What's the difference ? – Celerio Aug 05 '14 at 08:37
  • 1
    @Celerio: `uint_fast32_t` is defined as "usually fastest 32-bit or larger unsigned integer type", so yes, it should be the fastest type to use for this. `1U` is unsigned int 1, whereas `1` is int 1. `(word)1` and `(word)1U` refer to the exact same value -- `1`, of `word` type --, so the only difference is style: I like to keep the signedness/size suffix (`U`, `L`, `UL`, `LL`, `ULL`) in the literal number whenever the literal is anything other than `int` type. – Nominal Animal Aug 05 '14 at 12:41
  • Very nice answer indeed! But I got curious, what algorithm are you implementing, @Celerio. – Tarc Aug 05 '14 at 14:33
  • 1
    @Tarc One year later I realize that I wasn't even polite enough to answer you. It was the BJMM algorithm for attacking the McEliece cryptosystem. I had to do this for my master thesis in mathematics. – Celerio Aug 20 '15 at 15:30
  • Thank you for the answer, @Celerio. – Tarc Aug 20 '15 at 16:00
  • For fast bit-matrix transpose, use SSE2 / AVX2 `pmovmskb` to grab bits from each byte. https://stackoverflow.com/questions/31742483/how-would-you-transpose-a-binary-matrix – Peter Cordes Jul 30 '17 at 23:29
3

When you refer to the type uint16_t, uint64_t, ... I guess this is for the matrix coefficients. Therefore you should know better than us which value you are manipulating : if you potentially can generate huge numbers then you want a large type to avoid overflow. About the efficiency I doubt you will feel a difference in term of speed, but you can save some space by choosing a smaller type.

Anyway, this is all about optimization : you should not bother with strong types. To begin with, using char (or uint8_t) should be just fine because you are only dealing with 1 and 0.

I don't see any interest in switching from char matrix[][] to a typedef struct matrix_columns, I think you can perform your operation by wisely using the row and columns indexes.

Finally, to get the bit at position i in an unsigned int coef:

unsigned int bit = (coef>>i) & 1;

[EDIT] Also, here is a related question about binary matrix operations (multiplication, addition, XOR).

Community
  • 1
  • 1
n0p
  • 3,399
  • 2
  • 29
  • 50
  • Yes I'm talking about the matrix coefficients. These coefficients are `1` or `0` so they are not huge numbers. But I was not clear in my question. I edited it to make this more clear. And doesn't the `uint8_t`, `uint32_t`, `uint64_t` change the speed at which the processor applies XOR ? – Celerio Jul 31 '14 at 10:10
  • I think it depends on your processor architecture: how many bits it can process in one tick. If speed is a crucial issue for you, you can perform a benchmark: try several runs with different types for coefficients. – n0p Jul 31 '14 at 11:52
  • 2
    +1 for making your question clearer: what you want to do (storing several bits in an `unsigned int`) is actually a `bitfield`. It may be a nice optimization in your case but it is tricky and hardly readable/maintainable. You want to manage a matrix -> implement a matrix: `char matrix[][]`. Optimization comes later, once you have finished and profiled your program ;) – n0p Jul 31 '14 at 12:10
  • I've just unchecked your answer because I would like to have some other opinions. I've made some researches on bitfields and tried a few pieces of code to compare the exposed idea and I still feel like some of the ideas exposed in my post could work better than a char matrix[][]. By the way, I already did the program where I need this data strcture. :) – Celerio Aug 02 '14 at 10:10
2

I guess storing every row in multiple uintX_t's is a good idea, but I would choose X to match the word size of your processor. In this way you can sum multiple matrix entries at once:

uint8_t a = 3; // 00000011
uint8_t b = 2; // 00000010
b ^= a; //results 00000001 
        //for XOR is just addition in F_2

I think the bottleneck of your problem is the column operations, right? So after you perfom the partial diagonalization, why not operate the columns additions on its transpose, just in the same way you have done for the partial diagonalization? You would spend some time making this transposition, for it has to be done bit a bit, but after that, columns addition would be much more simple.

Suppose you have a 64x64 matrix:

uint64_t matrix[64]; // This means that matrix[i] is the i-th row

How yould you sum (mod 2) the j-th column to the l-th, for j and l between 1 and 64? You would have to iterate over all the lines (assuming higher positions in a matrix row are at less significant bits):

int i;
for (i=0; i<64; i++){
    matrix[i] ^= (matrix[i] & (1<<(64-j))) << (64-l);
}

But if you have had matrix transposed to a matrixT, the above for loop would be equivalent to:

matrixT[l] ^= matrixT[j];

which would be done in one single step by the processor (I guess, but not sure). So if you spend time doing the transposition after that you would benefit from the processor's power of doing arithmetic operations over data of its word size in one step.

Tarc
  • 3,214
  • 3
  • 29
  • 41
  • Thank you for confirming my intuition. I'm pretty sure the processor uses one single step to apply XOR on two `uint64_t` (my processor is a 64 bits). I will try to do so and compare with the execution time needed over a matrix of char as suggested by coconop. Rather than transposing, I was thinking about inverting the way the coefficients describe the matrix (column-wise or row-wise), but I guess both methods are equivalent. – Celerio Aug 04 '14 at 09:27
  • Well, maybe the initial partial diagonalization could be done as in the for I described in my answer and you just use the C matrix to store columns instead of lines. I guess such quasi diagonalization is not worth the trouble of getting the matrix transposed, right? – Tarc Aug 04 '14 at 14:19
  • I mean, if you invert the way the matrix coeficients are stored and choose to store them column-wise, then the initial diagonalization is the one which will be penalized. – Tarc Aug 04 '14 at 15:55