I found an answer to this question very well described for a rectangle matrix n=2 in here
What is the fastest way to transpose a matrix in C++?
But my question is how to do the same in more general case. Let have B a transpose of A as
B[i1][i2][J][i4][K][i6][i7] = A[i1][i2][K][i4][J][i6][i7]
so in this particular case of n=7 we do transpose between 3th and 5th index marked as J, K. I assume that whole data structure is within compact memory block float*. Brackets above is used just for symbolic expression of transform operation.
I am about to deal with larger dimensions (possibly n=7) with a lots of data (some dimensions has lower rank about 3-5 and some of them are realatively large about 1000).
Is there a way how to make a really fast algorithm avoid chache-misses or even better how to use advatage of SSE (or AVX) intrinsics just like in the question mentioned above?