[Disclaimer -- this is only a partial answer and a work in progress and answers a related problem, while only hinting at a solution to the actual question]
Before thinking about algorithms and code it is useful to understand the mathematical character of your problem. If we look at the output of your pseudocode in Python (and note that this includes the diagonal entries where the original question does not), we see this for the 5x5x5 case:
N = 5
x0 = np.zeros((N,N,N), dtype=np.int)
idx = 1
for i in range(0,N):
for j in range(i,N):
for k in range(j,N):
x0[i,j,k] = idx
idx += 1
print(x0)
we get:
[[[ 1 2 3 4 5]
[ 0 6 7 8 9]
[ 0 0 10 11 12]
[ 0 0 0 13 14]
[ 0 0 0 0 15]]
[[ 0 0 0 0 0]
[ 0 16 17 18 19]
[ 0 0 20 21 22]
[ 0 0 0 23 24]
[ 0 0 0 0 25]]
[[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 26 27 28]
[ 0 0 0 29 30]
[ 0 0 0 0 31]]
[[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 32 33]
[ 0 0 0 0 34]]
[[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 35]]]
i.e. the unique entries form a series of stacked upper triangular matrices of decreasing sizes. As identified in comments, the number of non-zero entries is a tetrahedral number, in this case for n = 5, the tetrahedral number Tr[5] = 5*(5+1)*(5+2)/6 = 35 entries, and the non zero entries fill a tetrahedral shaped region of the hypermatrix in three dimensions (best illustration here) And as noted in the original question, all the permutations of indices are functionally identical in the problem, meaning that there are six (3P3) functionally identical symmetric tetrahedral regions in the cubic hypermatrix. You can confirm this yourself:
x1 = np.zeros((N,N,N), dtype=np.int)
idx = 1
for i in range(0,N):
for j in range(0,N):
for k in range(0,N):
if (i <= j) and (j <= k):
x1[i,j,k] = idx
x1[i,k,j] = idx
x1[j,i,k] = idx
x1[j,k,i] = idx
x1[k,i,j] = idx
x1[k,j,i] = idx
idx += 1
print(x1)
which gives:
[[[ 1 2 3 4 5]
[ 2 6 7 8 9]
[ 3 7 10 11 12]
[ 4 8 11 13 14]
[ 5 9 12 14 15]]
[[ 2 6 7 8 9]
[ 6 16 17 18 19]
[ 7 17 20 21 22]
[ 8 18 21 23 24]
[ 9 19 22 24 25]]
[[ 3 7 10 11 12]
[ 7 17 20 21 22]
[10 20 26 27 28]
[11 21 27 29 30]
[12 22 28 30 31]]
[[ 4 8 11 13 14]
[ 8 18 21 23 24]
[11 21 27 29 30]
[13 23 29 32 33]
[14 24 30 33 34]]
[[ 5 9 12 14 15]
[ 9 19 22 24 25]
[12 22 28 30 31]
[14 24 30 33 34]
[15 25 31 34 35]]]
Here it should be obvious that you can slice the hypermatrix along any plane and get a symmetric matrix, and that it can be constructed by a set of reflections from any of the six permutations of the same basic tetrahedral hypermatrix.
That last part is important because I am now going to focus on another permutation from the one in your question. It is functionally the same (as shown above) but mathematically and graphically easier to visualize compared to the upper tetrahedron calculated by the original pseudocode in the question. Again some Python:
N = 5
nmax = N * (N+1) * (N+2) // 6
x= np.empty(nmax, dtype=object)
x2 = np.zeros((N,N,N), dtype=np.int)
idx = 1
for i in range(0,N):
for j in range(0,i+1):
for k in range(0,j+1):
x2[i,j,k] = idx
x[idx-1] = (i,j,k)
idx +=1
print(x)
print(x2)
which produces
[(0, 0, 0) (1, 0, 0) (1, 1, 0) (1, 1, 1) (2, 0, 0) (2, 1, 0) (2, 1, 1)
(2, 2, 0) (2, 2, 1) (2, 2, 2) (3, 0, 0) (3, 1, 0) (3, 1, 1) (3, 2, 0)
(3, 2, 1) (3, 2, 2) (3, 3, 0) (3, 3, 1) (3, 3, 2) (3, 3, 3) (4, 0, 0)
(4, 1, 0) (4, 1, 1) (4, 2, 0) (4, 2, 1) (4, 2, 2) (4, 3, 0) (4, 3, 1)
(4, 3, 2) (4, 3, 3) (4, 4, 0) (4, 4, 1) (4, 4, 2) (4, 4, 3) (4, 4, 4)]
[[[ 1 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]]
[[ 2 0 0 0 0]
[ 3 4 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]]
[[ 5 0 0 0 0]
[ 6 7 0 0 0]
[ 8 9 10 0 0]
[ 0 0 0 0 0]
[ 0 0 0 0 0]]
[[11 0 0 0 0]
[12 13 0 0 0]
[14 15 16 0 0]
[17 18 19 20 0]
[ 0 0 0 0 0]]
[[21 0 0 0 0]
[22 23 0 0 0]
[24 25 26 0 0]
[27 28 29 30 0]
[31 32 33 34 35]]]
You can see it is a transformation of the original code, with each "layer" of the tetrahedron built from a lower triangular matrix of increasing size, rather than upper triangular matrices of successively smaller size.
When you look at tetrahedron produced by this permutation, it should be obvious that each lower triangular slice starts at a tetrahedral number within the linear array of indices and each row within the lower triangular matrix starts at a triangular number offset relative to the start of the matrix. The indexing scheme is, therefore:
idx(i,j,k) = (i*(i+1)*(i+2)/6) + (j*(j+1)/2) + k
when data is arranged so that the kth dimension is the fastest varying in memory, and ith the slowest.
Now to the actual question. To calculate (i,j,k) from a given idx value would require calculating the integer cube root for i and the integer square root for j, which isn't particularly easy or performant and I would not imagine that it would offer any advantage over what you have now. However, if your implementation has a finite and known dimension a priori, you can use precalculated tetrahedral and triangular numbers and perform a lookup to replace the need to calculate roots.
A toy example:
#include <cstdio>
__constant__ unsigned int tetdata[100] =
{ 0, 1, 4, 10, 20, 35, 56, 84, 120, 165, 220, 286, 364, 455, 560, 680, 816, 969, 1140,
1330, 1540, 1771, 2024, 2300, 2600, 2925, 3276, 3654, 4060, 4495, 4960, 5456, 5984,
6545, 7140, 7770, 8436, 9139, 9880, 10660, 11480, 12341, 13244, 14190, 15180, 16215,
17296, 18424, 19600, 20825, 22100, 23426, 24804, 26235, 27720, 29260, 30856, 32509,
34220, 35990, 37820, 39711, 41664, 43680, 45760, 47905, 50116, 52394, 54740, 57155,
59640, 62196, 64824, 67525, 70300, 73150, 76076, 79079, 82160, 85320, 88560, 91881,
95284, 98770, 102340, 105995, 109736, 113564, 117480, 121485, 125580, 129766, 134044,
138415, 142880, 147440, 152096, 156849, 161700, 166650 };
__constant__ unsigned int tridata[100] =
{ 0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120,
136, 153, 171, 190, 210, 231, 253, 276, 300, 325, 351, 378, 406,
435, 465, 496, 528, 561, 595, 630, 666, 703, 741, 780, 820, 861,
903, 946, 990, 1035, 1081, 1128, 1176, 1225, 1275, 1326, 1378, 1431,
1485, 1540, 1596, 1653, 1711, 1770, 1830, 1891, 1953, 2016, 2080, 2145,
2211, 2278, 2346, 2415, 2485, 2556, 2628, 2701, 2775, 2850, 2926, 3003,
3081, 3160, 3240, 3321, 3403, 3486, 3570, 3655, 3741, 3828, 3916, 4005,
4095, 4186, 4278, 4371, 4465, 4560, 4656, 4753, 4851, 4950 };
__device__ unsigned int lookup(unsigned int&x, unsigned int n, const unsigned int* data)
{
int i=0;
while (n >= data[i]) i++;
x = data[i-1];
return i-1;
}
__device__ unsigned int tetnumber(unsigned int& x, unsigned int n) { return lookup(x, n, tetdata); }
__device__ unsigned int trinumber(unsigned int& x, unsigned int n) { return lookup(x, n, tridata); }
__global__ void kernel()
{
unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int x;
unsigned int k = idx;
unsigned int i = tetnumber(x, k); k -= x;
unsigned int j = trinumber(x, k); k -= x;
printf("idx = %d, i=%d j=%d k=%d\n", idx, i, j, k);
}
int main(void)
{
cudaSetDevice(0);
kernel<<<1,35>>>();
cudaDeviceSynchronize();
cudaDeviceReset();
return 0;
}
which does the same thing as the python (note the out-of-order print output):
$ nvcc -o tetrahedral tetrahedral.cu
avidday@marteil2:~/SO$ cuda-memcheck ./tetrahedral
========= CUDA-MEMCHECK
idx = 32, i=4 j=4 k=2
idx = 33, i=4 j=4 k=3
idx = 34, i=4 j=4 k=4
idx = 0, i=0 j=0 k=0
idx = 1, i=1 j=0 k=0
idx = 2, i=1 j=1 k=0
idx = 3, i=1 j=1 k=1
idx = 4, i=2 j=0 k=0
idx = 5, i=2 j=1 k=0
idx = 6, i=2 j=1 k=1
idx = 7, i=2 j=2 k=0
idx = 8, i=2 j=2 k=1
idx = 9, i=2 j=2 k=2
idx = 10, i=3 j=0 k=0
idx = 11, i=3 j=1 k=0
idx = 12, i=3 j=1 k=1
idx = 13, i=3 j=2 k=0
idx = 14, i=3 j=2 k=1
idx = 15, i=3 j=2 k=2
idx = 16, i=3 j=3 k=0
idx = 17, i=3 j=3 k=1
idx = 18, i=3 j=3 k=2
idx = 19, i=3 j=3 k=3
idx = 20, i=4 j=0 k=0
idx = 21, i=4 j=1 k=0
idx = 22, i=4 j=1 k=1
idx = 23, i=4 j=2 k=0
idx = 24, i=4 j=2 k=1
idx = 25, i=4 j=2 k=2
idx = 26, i=4 j=3 k=0
idx = 27, i=4 j=3 k=1
idx = 28, i=4 j=3 k=2
idx = 29, i=4 j=3 k=3
idx = 30, i=4 j=4 k=0
idx = 31, i=4 j=4 k=1
========= ERROR SUMMARY: 0 errors
Obviously the lookup function is only for demonstration purposes. At large sizes either a binary array or hash based look-up would be much faster. But this at least demonstrates that it seems possible to do what you envisaged, even if the problem solved and approach are subtly different from what you probably had in mind.
Note I have no formal mathemtical proofs for anything in this answer and don't claim that any of the code or propositions here are correct. Buyer beware.
After some more thought, it is trivial to extend this approach via a hybrid search/calculation routine which is reasonably efficient:
#include <iostream>
#include <vector>
#include <cstdio>
typedef unsigned int uint;
__device__ __host__ ulong tetnum(uint n) { ulong n1(n); return n1 * (n1 + 1ull) * (n1 + 2ull) / 6ull; }
__device__ __host__ ulong trinum(uint n) { ulong n1(n); return n1 * (n1 + 1ull) / 2ull; }
typedef ulong (*Functor)(uint);
template<Functor F>
__device__ __host__ uint bounded(ulong& y, ulong x, uint n1=0, ulong y1=0)
{
uint n = n1;
y = y1;
while (x >= y1) {
y = y1;
n = n1++;
y1 = F(n1);
}
return n;
}
__constant__ uint idxvals[19] = {
0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384,
32768, 65536, 131072 };
__constant__ ulong tetvals[19] = {
0, 1, 4, 20, 120, 816, 5984, 45760, 357760, 2829056, 22500864, 179481600, 1433753600,
11461636096, 91659526144, 733141975040, 5864598896640, 46914643623936, 375308558925824 };
__constant__ ulong trivals[19] = {
0, 1, 3, 10, 36, 136, 528, 2080, 8256, 32896, 131328, 524800, 2098176, 8390656, 33558528,
134225920, 536887296, 2147516416, 8590000128 };
__device__ __host__ uint lookup(ulong& x, uint n, const uint* abscissa, const ulong* data)
{
uint i=0;
while (n >= data[i]) i++;
x = data[i-1];
return abscissa[i-1];
}
__device__ uint tetnumber(ulong& x, uint n)
{
ulong x0;
uint n0 = lookup(x0, n, idxvals, tetvals);
return bounded<tetnum>(x, n, n0, x0);
}
__device__ uint trinumber(ulong& x, uint n)
{
ulong x0;
uint n0 = lookup(x0, n, idxvals, trivals);
return bounded<trinum>(x, n, n0, x0);
}
__global__ void kernel(uint3 *results, ulong Nmax)
{
ulong idx = threadIdx.x + blockIdx.x * blockDim.x;
ulong gridStride = blockDim.x * gridDim.x;
for(; idx < Nmax; idx += gridStride) {
ulong x, k1 = idx;
uint3 tuple;
tuple.x = tetnumber(x, k1); k1 -= x;
tuple.y = trinumber(x, k1); k1 -= x;
tuple.z = (uint)k1;
results[idx] = tuple;
}
}
int main(void)
{
cudaSetDevice(0);
uint N = 500;
ulong Nmax = tetnum(N);
uint3* results_d; cudaMalloc(&results_d, Nmax * sizeof(uint3));
int gridsize, blocksize;
cudaOccupancyMaxPotentialBlockSize(&gridsize, &blocksize, kernel);
kernel<<<gridsize, blocksize>>>(results_d, Nmax);
cudaDeviceSynchronize();
std::vector<uint3> results(Nmax);
cudaMemcpy(&results[0], results_d, Nmax * sizeof(uint3), cudaMemcpyDeviceToHost);
cudaDeviceReset();
// Only uncomment this if you want to see 22 million lines of output
//for(auto const& idx : results) {
// std::cout << idx.x << " " << idx.y << " " << idx.z << std::endl;
//}
return 0;
}
which does this (be aware it will emit 21 million lines of output if you uncomment the last loop):
$ module load use.own cuda9.2
$ nvcc -std=c++11 -arch=sm_52 -o tetrahedral tetrahedral.cu
$ nvprof ./tetrahedral
==20673== NVPROF is profiling process 20673, command: ./tetrahedral
==20673== Profiling application: ./tetrahedral
==20673== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 78.85% 154.23ms 1 154.23ms 154.23ms 154.23ms kernel(uint3*, unsigned long)
21.15% 41.361ms 1 41.361ms 41.361ms 41.361ms [CUDA memcpy DtoH]
API calls: 41.73% 154.24ms 1 154.24ms 154.24ms 154.24ms cudaDeviceSynchronize
30.90% 114.22ms 1 114.22ms 114.22ms 114.22ms cudaMalloc
15.94% 58.903ms 1 58.903ms 58.903ms 58.903ms cudaDeviceReset
11.26% 41.604ms 1 41.604ms 41.604ms 41.604ms cudaMemcpy
0.11% 412.75us 96 4.2990us 275ns 177.45us cuDeviceGetAttribute
0.04% 129.46us 1 129.46us 129.46us 129.46us cuDeviceTotalMem
0.02% 55.616us 1 55.616us 55.616us 55.616us cuDeviceGetName
0.01% 32.919us 1 32.919us 32.919us 32.919us cudaLaunchKernel
0.00% 10.211us 1 10.211us 10.211us 10.211us cudaSetDevice
0.00% 5.7640us 1 5.7640us 5.7640us 5.7640us cudaFuncGetAttributes
0.00% 4.6690us 1 4.6690us 4.6690us 4.6690us cuDeviceGetPCIBusId
0.00% 2.8580us 4 714ns 393ns 1.3680us cudaDeviceGetAttribute
0.00% 2.8050us 3 935ns 371ns 2.0030us cuDeviceGetCount
0.00% 2.2780us 1 2.2780us 2.2780us 2.2780us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.00% 1.6720us 1 1.6720us 1.6720us 1.6720us cudaGetDevice
0.00% 1.5450us 2 772ns 322ns 1.2230us cuDeviceGet
That code calculates and stores the unique (i,j,k) pairs for a 500 x 500 x 500 search space (about 21 million values) in 150 milliseconds on my GTX970. Perhaps that is some use to you.