3

I am currently working on a project where I would like to optimize some numerical computation in Python by calling C.

In short, I need to compute the value of y[i] = f(x[i]) for each element in an huge array x (typically has 10^9 entries or more). Here, x[i] is an integer between -10 and 10 and f is function that takes x[i] and returns a double. My issue is that f but it takes a very long time to evaluate in a way that is numerically stable.

To speed things up, I would like to just hard code all 2*10 + 1 possible values of f(x[i]) into constant array such as:

double table_of_values[] = {f(-10), ...., f(10)};

And then just evaluate f using a "lookup table" approach as follows:

for (i = 0; i < N; i++) {
    y[i] = table_of_values[x[i] + 11]; //instead of y[i] = f(x[i])
}

Since I am not really well-versed at writing optimized code in C, I am wondering:

  1. Specifically - since x is really large - I'm wondering if it's worth doing second-degree optimization when evaluating the loop (e.g. by sorting x beforehand, or by finding a smart way to deal with the negative indices (aside from just doing [x[i] + 10 + 1])?

  2. Say x[i] were not between -10 and 10, but between -20 and 20. In this case, I could still use the same approach, but would need to hard code the lookup table manually. Is there a way to generate the look-up table dynamically in the code so that I make use of the same approach and allow for x[i] to belong to a variable range?

chqrlie
  • 131,814
  • 10
  • 121
  • 189
Berk U.
  • 7,018
  • 6
  • 44
  • 69
  • arrays can't have negative indexes. just normalize to `arr[maybe_negative_index+20]` – Marc B Feb 10 '16 at 21:33
  • Ad 1) What does the size of `x` have to do with the size of the lookup table? x will use gigabytes of memory whereas the lookup table only a 100 bytes or so. Ad 2) Yes - you could write a python script to generate C source code (;-)), or, `malloc` the lookup table and loop over -20..20, calling `f`. A small suggestion: why not implement the lookup table in python first, and see if it makes a difference in speed. It's more likely that the size of the dataset (a billion doubles, using 4 or 8Gb of ram) is the issue. How much ram do you have? – Kenney Feb 10 '16 at 21:44
  • LUT method would finally come to the tradeoff between accuracy and table size. It's common practice to use a small table a a rough guess and run a few iterations of approximations to get the final result. – user3528438 Feb 10 '16 at 21:57
  • If you *can* sort `x` and deal with consequences, it can be *very* efficiently compressed by a simple run-length algorithm, as it has a very limited number of possible values. In your case it will turn into array of size of order 2*21=42 elements... – Eugene Sh. Feb 10 '16 at 22:25
  • Agree with @Marc, but would state it as `table_of_values[max_index - min_index+ 1]; y[i] = table_of_values[i - min_index];` – chux - Reinstate Monica Feb 10 '16 at 23:03
  • @marcb it's well defined to create a pointer to an element at a given index in the array, and access elements before it using a negative subscript, for the record. – Richard J. Ross III Feb 10 '16 at 23:04

3 Answers3

5

It's fairly easy to generate such a table with dynamic range values.

Here's a simple, single table method:

#include <malloc.h>

#define VARIABLE_USED(_sym) \
    do { \
        if (1) \
            break; \
        if (!! _sym) \
            break; \
    } while (0)

double *table_of_values;
int table_bias;

// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif

#define XLEN        (1 << 9)
xval_t *x;

// fslow -- your original function
double
fslow(int i)
{
    return 1;  // whatever
}

// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
    int len;

    table_bias = -lo;

    len = hi - lo;
    len += 1;

    // NOTE: you can do free(table_of_values) when no longer needed
    table_of_values = malloc(sizeof(double) * len);

    for (int i = lo;  i <= hi;  ++i)
        table_of_values[i + table_bias] = f(i);
}

// fcached -- retrieve cached table data
double
fcached(int i)
{
    return table_of_values[i + table_bias];
}

// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
    double *tptr;
    int bias;
    double val;

    // ensure these go into registers to prevent needless extra memory fetches
    tptr = table_of_values;
    bias = table_bias;

    for (int i = 0;  i < XLEN;  ++i) {
        val = tptr[x[i] + bias];

        // do stuff with val
        VARIABLE_USED(val);
    }
}

int
main(void)
{

    ftablegen(fslow,-10,10);
    x = malloc(sizeof(xval_t) * XLEN);
    fripper(x);

    return 0;
}

Here's a slightly more complex way that allows many similar tables to be generated:

#include <malloc.h>

#define VARIABLE_USED(_sym) \
    do { \
        if (1) \
            break; \
        if (!! _sym) \
            break; \
    } while (0)

// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 1
typedef short xval_t;
#endif
#if 0
typedef char xval_t;
#endif

#define XLEN        (1 << 9)
xval_t *x;

struct table {
    int tbl_lo;                         // lowest index
    int tbl_hi;                         // highest index
    int tbl_bias;                       // bias for index
    double *tbl_data;                   // cached data
};

struct table ftable1;
struct table ftable2;

double
fslow(int i)
{
    return 1;  // whatever
}

double
f2(int i)
{
    return 2;  // whatever
}

// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi,struct table *tbl)
{
    int len;

    tbl->tbl_bias = -lo;

    len = hi - lo;
    len += 1;

    // NOTE: you can do free tbl_data when no longer needed
    tbl->tbl_data = malloc(sizeof(double) * len);

    for (int i = lo;  i <= hi;  ++i)
        tbl->tbl_data[i + tbl->tbl_bias] = fslow(i);
}

// fcached -- retrieve cached table data
double
fcached(struct table *tbl,int i)
{
    return tbl->tbl_data[i + tbl->tbl_bias];
}

// fripper -- access x and table arrays
void
fripper(xval_t *x,struct table *tbl)
{
    double *tptr;
    int bias;
    double val;

    // ensure these go into registers to prevent needless extra memory fetches
    tptr = tbl->tbl_data;
    bias = tbl->tbl_bias;

    for (int i = 0;  i < XLEN;  ++i) {
        val = tptr[x[i] + bias];

        // do stuff with val
        VARIABLE_USED(val);
    }
}

int
main(void)
{

    x = malloc(sizeof(xval_t) * XLEN);

    // NOTE: we could use 'char' for xval_t ...
    ftablegen(fslow,-37,62,&ftable1);
    fripper(x,&ftable1);

    // ... but, this forces us to use a 'short' for xval_t
    ftablegen(f2,-99,307,&ftable2);

    return 0;
}

Notes:

fcached could/should be an inline function for speed. Notice that once the table is calculated once, fcached(x[i]) is quite fast. The index offset issue you mentioned [solved by the "bias"] is trivially small in calculation time.

While x may be a large array, the cached array for f() values is fairly small (e.g. -10 to 10). Even if it were (e.g.) -100 to 100, this is still about 200 elements. This small cached array will [probably] stay in the hardware memory cache, so access will remain quite fast.

Thus, sorting x to optimize H/W cache performance of the lookup table will have little to no [measurable] effect.

The access pattern to x is independent. You'll get best performance if you access x in a linear manner (e.g. for (i = 0; i < 999999999; ++i) x[i]). If you access it in a semi-random fashion, it will put a strain on the H/W cache logic and its ability to keep the needed/wanted x values "cache hot"

Even with linear access, because x is so large, by the time you get to the end, the first elements will have been evicted from the H/W cache (e.g. most CPU caches are on the order of a few megabytes)

However, if x only has values in a limited range, changing the type from int x[...] to short x[...] or even char x[...] cuts the size by a factor of 2x [or 4x]. And, that can have a measurable improvement on the performance.

Update: I've added an fripper function to show the fastest way [that I know of] to access the table and x arrays in a loop. I've also added a typedef named xval_t to allow the x array to consume less space (i.e. will have better H/W cache performance).


UPDATE #2:

Per your comments ...

fcached was coded [mostly] to illustrate simple/single access. But, it was not used in the final example.

The exact requirements for inline has varied over the years (e.g. was extern inline). Best use now: static inline. However, if using c++, it may be, yet again different. There are entire pages devoted to this. The reason is because of compilation in different .c files, what happens when optimization is on or off. Also, consider using a gcc extension. So, to force inline all the time:

__attribute__((__always_inline__)) static inline

fripper is the fastest because it avoids refetching globals table_of_values and table_bias on each loop iteration. In fripper, compiler optimizer will ensure they remain in registers. See my answer: Is accessing statically or dynamically allocated memory faster? as to why.

However, I coded an fripper variant that uses fcached and the disassembled code was the same [and optimal]. So, we can disregard that ... Or, can we? Sometimes, disassembling the code is a good cross check and the only way to know for sure. Just an extra item when creating fully optimized C code. There are many options one can give to the compiler regarding code generation, so sometimes it's just trial and error.

Because benchmarking is important, I threw in my routines for timestamping (FYI, [AFAIK] the underlying clock_gettime call is the basis for python's time.clock()).

So, here's the updated version:

#include <malloc.h>
#include <time.h>

typedef long long s64;

#define SUPER_INLINE \
    __attribute__((__always_inline__)) static inline

#define VARIABLE_USED(_sym) \
    do { \
        if (1) \
            break; \
        if (!! _sym) \
            break; \
    } while (0)

#define TVSEC           1000000000LL    // nanoseconds in a second
#define TVSECF          1e9             // nanoseconds in a second

// tvget -- get high resolution time of day
// RETURNS: absolute nanoseconds
s64
tvget(void)
{
    struct timespec ts;
    s64 nsec;

    clock_gettime(CLOCK_REALTIME,&ts);

    nsec = ts.tv_sec;
    nsec *= TVSEC;
    nsec += ts.tv_nsec;

    return nsec;
)

// tvgetf -- get high resolution time of day
// RETURNS: fractional seconds
double
tvgetf(void)
{
    struct timespec ts;
    double sec;

    clock_gettime(CLOCK_REALTIME,&ts);

    sec = ts.tv_nsec;
    sec /= TVSECF;
    sec += ts.tv_sec;

    return sec;
)

double *table_of_values;
int table_bias;

double *dummyptr;

// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif

#define XLEN        (1 << 9)
xval_t *x;

// fslow -- your original function
double
fslow(int i)
{
    return 1;  // whatever
}

// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
    int len;

    table_bias = -lo;

    len = hi - lo;
    len += 1;

    // NOTE: you can do free(table_of_values) when no longer needed
    table_of_values = malloc(sizeof(double) * len);

    for (int i = lo;  i <= hi;  ++i)
        table_of_values[i + table_bias] = f(i);
}

// fcached -- retrieve cached table data
SUPER_INLINE double
fcached(int i)
{
    return table_of_values[i + table_bias];
}

// fripper_fcached -- access x and table arrays
void
fripper_fcached(xval_t *x)
{
    double val;
    double *dptr;

    dptr = dummyptr;

    for (int i = 0;  i < XLEN;  ++i) {
        val = fcached(x[i]);

        // do stuff with val
        dptr[i] = val;
    }
}

// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
    double *tptr;
    int bias;
    double val;
    double *dptr;

    // ensure these go into registers to prevent needless extra memory fetches
    tptr = table_of_values;
    bias = table_bias;
    dptr = dummyptr;

    for (int i = 0;  i < XLEN;  ++i) {
        val = tptr[x[i] + bias];

        // do stuff with val
        dptr[i] = val;
    }
}

int
main(void)
{

    ftablegen(fslow,-10,10);
    x = malloc(sizeof(xval_t) * XLEN);
    dummyptr = malloc(sizeof(double) * XLEN);
    fripper(x);
    fripper_fcached(x);

    return 0;
}
Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • Could you please elaborate on the usefulness of your macro `VARIABLE_USED`? Is its sole purpose to avoid the preceding assignment being optimized away by the compiler? If so, why not just use `volatile`? – vsz Feb 11 '16 at 05:10
  • 1
    @vsz It's just a boilerplate macro I have to prevent the compiler [with `-Wall`] from complaining about this example code [which doesn't actually _do_ anything with `val`]. Without it, the code example will produce: `warning: variable ‘val’ set but not used [-Wunused-but-set-variable]`. In real code, `VARIABLE_USED` would be replaced with code that does something useful with `val`. An alternate way would have been to create `void dummy(double val) {}` and replace `VARIABLE_USED(val)` with `dummy(val)` – Craig Estey Feb 11 '16 at 05:18
  • @CraigEstey Thank you for this **incredible** answer. I have some silly questions about `fcached` since that seems like it would be useful. It looks like `fcached` is declared in the code but not really used anywhere. Is that because you replaced `fcached` with `fripper`? If not, where would you plug it in? And would adding `inline` to the declaration to make sure it's inline? – Berk U. Feb 11 '16 at 15:27
1

You can have negative indices in your arrays. (I am not sure if this is in the specifications.) If you have the following code:

int arr[] = {1, 2 ,3, 4, 5};
int* lookupTable = arr + 3;
printf("%i", lookupTable[-2]);

it will print out 2.


This works because arrays in c are defined as pointers. And if the pointer does not point to the begin of the array, you can access the item before the pointer.

Keep in mind though that if you have to malloc() the memory for arr you probably cannot use free(lookupTable) to free it.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
David van rijn
  • 2,108
  • 9
  • 21
  • *arrays in c are defined as pointers*, please clarify this nonsense... and you **definitely** cannot use `free(lookupTable)` to free `arr`. – chqrlie Feb 10 '16 at 21:50
  • "have negative indices in your arrays" is not quite right. The _array_ `arr` cannot be accessed via a negative index, but the _pointer_ `lookupTable` can add a negative to it. – chux - Reinstate Monica Feb 10 '16 at 22:59
1

I really think Craig Estey is on the right track for building your table in an automatic way. I just want to add a note for looking up the table.

If you know that you will run the code on a Haswell machine (with AVX2) you should make sure your code utilise VGATHERDPD which you can utilize with the _mm256_i32gather_pd intrinsic. If you do that, your table lookups will fly! (You can even detect avx2 on the fly with cpuid(), but that's another story)

EDIT:
Let me elaborate with some code:

#include <stdint.h> 
#include <stdio.h> 
#include <immintrin.h> 
/* I'm not sure if you need the alignment */
double table[8]  __attribute__((aligned(16)))= { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };

int main()
{
    int32_t i[4] = { 0,2,4,6 };
    __m128i index = _mm_load_si128( (__m128i*) i );
    __m256d result = _mm256_i32gather_pd( table, index, 8 );

    double* f = (double*)&result;
    printf("%f %f %f %f\n", f[0], f[1], f[2], f[3]);
    return 0;
}

Compile and run:

$ gcc --std=gnu99 -mavx2 gathertest.c -o gathertest && ./gathertest
0.100000 0.300000 0.500000 0.700000

This is fast!