60

From some comments that I have read in here, it is preferable to have Structure of Arrays (SoA) over Array of Structures (AoS) for parallel implementations like CUDA. If that is true, can anyone explain why?

Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
BugShotGG
  • 5,008
  • 8
  • 47
  • 63
  • 3
    I have posted a comparison of the performance between Structures of Arrays and Arrays of Structures as an answer to this post: [Sorting arrays of structures in CUDA](http://stackoverflow.com/questions/23541503/sorting-arrays-of-structures-in-cuda/23645954#23645954). – Vitality May 25 '14 at 20:20

3 Answers3

76

Choice of AoS versus SoA for optimum performance usually depends on access pattern. This is not just limited to CUDA however - similar considerations apply for any architecture where performance can be significantly affected by memory access pattern, e.g. where you have caches or where performance is better with contiguous memory access (e.g. coalesced memory accesses in CUDA).

E.g. for RGB pixels versus separate RGB planes:

struct {
    uint8_t r, g, b;
} AoS[N];

struct {
    uint8_t r[N];
    uint8_t g[N];
    uint8_t b[N];
} SoA;

If you are going to be accessing the R/G/B components of each pixel concurrently then AoS usually makes sense, since the successive reads of R, G, B components will be contiguous and usually contained within the same cache line. For CUDA this also means memory read/write coalescing.

However if you are going to process color planes separately then SoA might be preferred, e.g. if you want to scale all R values by some scale factor, then SoA means that all R components will be contiguous.

One further consideration is padding/alignment. For the RGB example above each element in an AoS layout is aligned to a multiple of 3 bytes, which may not be convenient for CUDA, SIMD, et al - in some cases perhaps even requiring padding within the struct to make alignment more convenient (e.g. add a dummy uint8_t element to ensure 4 byte alignment). In the SoA case however the planes are byte aligned which can be more convenient for certain algorithms/architectures.

For most image processing type applications the AoS scenario is much more common, but for other applications, or for specific image processing tasks this may not always be the case. When there is no obvious choice I would recommend AoS as the default choice.

See also this answer for more general discussion of AoS v SoA.

Community
  • 1
  • 1
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Thanks for the reply! Hmm so its up to the implementation's access pattern as I imagined. So, if we have 1 thread accessing 10 contiguous elements from an AoS and lets say we have 100 total elements, is SoA(each with those 10 elements) a better choice than letting every thread access 10 contiguous elements from a AoS, from performance point of view? – BugShotGG Jul 29 '13 at 13:14
  • For the RGB case, if each thread is accessing R, G, B components together then AoS is probably better, whereas if each thread is only accessing one color plane, then SoA is probably better. However you need to look at the total access pattern for all threads, since external memory accesses are usually the bottleneck. – Paul R Jul 29 '13 at 13:16
10

I just want to provide a simple example showing how a Struct of Arrays (SoA) performs better than an Array of Structs (AoS).

In the example, I'm considering three different versions of the same code:

  1. SoA (v1)
  2. Straight arrays (v2)
  3. AoS (v3)

In particular, version 2 considers the use of straight arrays. The timings of versions 2 and 3 are the same for this example and result to be better than version 1. I suspect that, in general, straight arrays could be preferable, although at the expense of readability, since, for example, loading from uniform cache could be enabled through const __restrict__ for this case.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

#include <thrust\device_vector.h>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

#define BLOCKSIZE   1024

/******************************************/
/* CELL STRUCT LEADING TO ARRAY OF STRUCT */
/******************************************/
struct cellAoS {

    unsigned int    x1;
    unsigned int    x2;
    unsigned int    code;
    bool            done;

};

/*******************************************/
/* CELL STRUCT LEADING TO STRUCT OF ARRAYS */
/*******************************************/
struct cellSoA {

    unsigned int    *x1;
    unsigned int    *x2;
    unsigned int    *code;
    bool            *done;

};


/*******************************************/
/* KERNEL MANIPULATING THE ARRAY OF STRUCT */
/*******************************************/
__global__ void AoSvsSoA_v1(cellAoS *d_cells, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {
        cellAoS tempCell = d_cells[tid];

        tempCell.x1 = tempCell.x1 + 10;
        tempCell.x2 = tempCell.x2 + 10;

        d_cells[tid] = tempCell;
    }

}

/******************************/
/* KERNEL MANIPULATING ARRAYS */
/******************************/
__global__ void AoSvsSoA_v2(unsigned int * __restrict__ d_x1, unsigned int * __restrict__ d_x2, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

        d_x1[tid] = d_x1[tid] + 10;
        d_x2[tid] = d_x2[tid] + 10;

    }

}

/********************************************/
/* KERNEL MANIPULATING THE STRUCT OF ARRAYS */
/********************************************/
__global__ void AoSvsSoA_v3(cellSoA cell, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

        cell.x1[tid] = cell.x1[tid] + 10;
        cell.x2[tid] = cell.x2[tid] + 10;

    }

}

/********/
/* MAIN */
/********/
int main() {

    const int N = 2048 * 2048 * 4;

    TimingGPU timerGPU;

    thrust::host_vector<cellAoS>    h_cells(N);
    thrust::device_vector<cellAoS>  d_cells(N);

    thrust::host_vector<unsigned int>   h_x1(N);
    thrust::host_vector<unsigned int>   h_x2(N);

    thrust::device_vector<unsigned int> d_x1(N);
    thrust::device_vector<unsigned int> d_x2(N);

    for (int k = 0; k < N; k++) {

        h_cells[k].x1 = k + 1;
        h_cells[k].x2 = k + 2;
        h_cells[k].code = k + 3;
        h_cells[k].done = true;

        h_x1[k] = k + 1;
        h_x2[k] = k + 2;

    }

    d_cells = h_cells;

    d_x1 = h_x1;
    d_x2 = h_x2;

    cellSoA cell;
    cell.x1 = thrust::raw_pointer_cast(d_x1.data());
    cell.x2 = thrust::raw_pointer_cast(d_x2.data());
    cell.code = NULL;
    cell.done = NULL;

    timerGPU.StartCounter();
    AoSvsSoA_v1 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_cells.data()), N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing AoSvsSoA_v1 = %f\n", timerGPU.GetCounter());

    //timerGPU.StartCounter();
    //AoSvsSoA_v2 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(thrust::raw_pointer_cast(d_x1.data()), thrust::raw_pointer_cast(d_x2.data()), N);
    //gpuErrchk(cudaPeekAtLastError());
    //gpuErrchk(cudaDeviceSynchronize());
    //printf("Timing AoSvsSoA_v2 = %f\n", timerGPU.GetCounter());

    timerGPU.StartCounter();
    AoSvsSoA_v3 << <iDivUp(N, BLOCKSIZE), BLOCKSIZE >> >(cell, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
    printf("Timing AoSvsSoA_v3 = %f\n", timerGPU.GetCounter());

    h_cells = d_cells;

    h_x1 = d_x1;
    h_x2 = d_x2;

    // --- Check results
    for (int k = 0; k < N; k++) {
        if (h_x1[k] != k + 11) {
            printf("h_x1[%i] not equal to %i\n", h_x1[k], k + 11);
            break;
        }
        if (h_x2[k] != k + 12) {
            printf("h_x2[%i] not equal to %i\n", h_x2[k], k + 12);
            break;
        }
        if (h_cells[k].x1 != k + 11) {
            printf("h_cells[%i].x1 not equal to %i\n", h_cells[k].x1, k + 11);
            break;
        }
        if (h_cells[k].x2 != k + 12) {
            printf("h_cells[%i].x2 not equal to %i\n", h_cells[k].x2, k + 12);
            break;
        }
    }

}

The following are the timings (runs performed on a GTX960):

Array of struct        9.1ms (v1 kernel)
Struct of arrays       3.3ms (v3 kernel)
Straight arrays        3.2ms (v2 kernel)
Vitality
  • 20,705
  • 4
  • 108
  • 146
1

SoA is effectly good for SIMD processing. For several reason, but basically it's more efficient to load 4 consecutive floats in a register. With something like:

 float v [4] = {0};
 __m128 reg = _mm_load_ps( v );

than using:

 struct vec { float x; float, y; ....} ;
 vec v = {0, 0, 0, 0};

and create an __m128 data by accessing all member:

 __m128 reg = _mm_set_ps(v.x, ....);

if your arrays are 16-byte aligned data load/store are faster and some op can be perform directly in memory.

Paul R
  • 208,748
  • 37
  • 389
  • 560
alexbuisson
  • 7,699
  • 3
  • 31
  • 44
  • You could cast though perhaps and "turn it into an array": __m128 reg = _mm_load_ps( &v.x ); This would assume that is 16-byte aligned with 0 bytes between the members. – pilkch Jan 23 '15 at 06:06