The fastest way to retrieve 16k Key-Value pairs?

Question

OK, here's my situation :

I have a function - let's say U64 calc (U64 x) - which takes a 64-bit integer parameter, performs some CPU-intensive operation, and returns a 64-bit value
Now, given that I know ALL possible inputs (the xs) of that function beforehand (there are some 16000 though), I thought it might be better to pre-calculate them and then fetch them on demand (from an array-like structure).
The ideal situation would be to store them all in some array U64 CALC[] and retrieve them by index (the x again)
And here's the issue : I may know what the possible inputs for my calc function are, but they are most definitely NOT consecutive (e.g. not from 1 to 16000, but values that may go as low as 0 and as high as some trillions - always withing a 64-bit range)

E.G.

  X        CALC[X]
-----------------------
  123123   123123123
  12312    12312312
  897523   986123

  etc.

And here comes my question :

How would you store them?
What workaround would you prefer?
Now, given that these values (from CALC) will have to be accessed some thousands-to-millions of times, per sec, which would be the best solution performance-wise?

Note : I'm no mentioning anything I've thought of or tried so as not to turn the answers into some debate of A vs B type-of-thing, and mostly not influence anyone...

Profile using map, set, and unordered_map, then make an informed decision. — Retired Ninja, Dec 16 '12 at 04:48
Showing what you've tried might help. There are also some answers [here](http://stackoverflow.com/questions/4306/what-is-the-best-way-to-create-a-sparse-array-in-c) that might help, including one that implies using a map has pretty good performance. Also consider a hashmap. — Rob I, Dec 16 '12 at 04:48
you could also try to use a [trie](http://stackoverflow.com/questions/1036504/trie-implementation). — didierc, Dec 16 '12 at 04:50
if you know the access patterns of your data structures, you could also implement some caching techniques (LRU, MRU comes to mind). — didierc, Dec 16 '12 at 05:13
You could consider using a minimal perfect hash. Most of the literature for this concentrates on strings, but I see no reason the technique could not be used on integers. — md5i, Dec 16 '12 at 05:24
qsort(), besearch() of an array of 2-member structs. Every last time, in 2 seconds flat. — , Dec 16 '12 at 07:24
I have to say, looking at your profile, it's hard to believe you're asking this question. Something's not right. This is CS 101 stuff. — , Dec 16 '12 at 07:46
@DrKameleon There seems to be some confusion about the number of elements. Is it 16,000 or 16,000 thousand? IE: 16 million? — , Dec 16 '12 at 09:41
@RocketRoy It's around 16,000 (2^14 to be precise). Just edited the original post too. — Dr.Kameleon, Dec 16 '12 at 10:37
@didierc It's a valid-move occupancy-aware generator function for all possible bitboards per piece type/position... (What I'm talking about? A Chess Engine... :-)) — Dr.Kameleon, Dec 16 '12 at 20:53
@didierc it maps all possible occupancy bitboards (per piece position) to their equivalent "attackable" bitboard (which squares can be attacked by piece X, on square Y, given an occupancy mask for either Rank/File or Diagonals) — Dr.Kameleon, Dec 17 '12 at 05:16
so you have more than just a bitboard on input, but I suppose that the other parameters are used as a first step index. — didierc, Dec 17 '12 at 08:59
@didierc I don't know how I'm finally going to do it; what I'm currently thinking is to perhaps set a primary index (e.g. position) and use an array of maps. — Dr.Kameleon, Dec 17 '12 at 12:51
Based on a LOT of benchmarking for my post below, Hash and Bsearch2() are neck & neck - and that's assuming you spend a lot of time creating a good custom hash function. STL unordered_map won't even come close to Bsearch2(), which holds up much better under resource stress. Anything larger than 16k Bsearch2() will win. — , Oct 31 '14 at 03:40

Olof Forshell · Accepted Answer · 2016-01-03T09:52:12.040

I would use some sort of hash function that creates an index to a u64 pair where one is the value the key was created from and the other the replacement value. Technically the index could be three bytes long (assuming 16 million -"16000 thousand" - pairs) if you need to conserve space but I'd use u32s. If the stored value does not match the value computed on (hash collision) you'd enter an overflow handler.

You need to determine a custom hashing algorithm to fit your data
Since you know the size of the data you don't need algorithms that allow the data to grow.
I'd be wary of using some standard algorithm because they seldom fit specific data
I'd be wary of using a C++ method unless you are sure the code is WYSIWYG (doesn't generate a lot of invisible calls)
Your index should be 25% larger than the number of pairs

Run through all possible inputs and determine min, max, average and standard deviation for the number of collisions and use these to determine the acceptable performance level. Then profile to achieve the best possible code.

The required memory space (using a u32 index) comes out to (4*1.25)+8+8 = 21 bytes per pair or 336 MeB, no problem on a typical PC.

________ EDIT________

I have been challenged by "RocketRoy" to put my money where my mouth is. Here goes:

The problem has to do with collision handling in a (fixed size) hash index. To set the stage:

I have a list of n entries where a field in the entry contains the value v that the hash is computed from
I have a vector of n*1.25 (approximately) indeces such that the number of indeces x is a prime number
A prime number y is computed which is a fraction of x
The vector is initialized to say -1 to denote unoccupied

Pretty standard stuff I think you'll agree.

The entries in the list are processed and the hash value h computed and modulo'd and used as an index into the vector and the index to the entry is placed there.

Eventually I encounter the situation where the vector entry pointed to by the index is occupied (doesn't contain -1) - voilà, a collision.

So what do I do? I keep the original h as ho, add y to h and take modulo x and get a new index into the vector. If the entry is unoccupied I use it, otherwise I continue with add y modulo x until I reach ho. In theory, this will happen because both x and y are prime numbers. In practice x is larger than n so it won't.

So the "re-hash" that RocketRoy claims is very costly is no such thing.

The tricky part with this method - as with all hashing methods - is to:

Determine a suitable value for x (becomes less tricky the larger x finally used)
Determine the algorithm a for h=a(v)%x such that a the h's index reasonably evenly ("randomly") into the index vector with as few collisions as possible
Determine a suitable value for y such that collisions index reasonably evenly ("randomly") into the index vector

________ EDIT________

I'm sorry I've taken so long to produce this code ... other things have had higher priorities.

Anyway, here is the code which proves that hashing has better prospects for quick lookups than a binary tree. It runs through a bunch of hashing index sizes and algorithms to aid in finding the most suitable combo for the specific data. For every algorithm the code will print the first index size such that no lookup takes longer than fourteen searches (worst case for binary searching) and an average lookup takes less than 1.5 searches.

I have a fondness for prime numbers in these types of applications, in case you haven't noticed.

There are many ways of creating a hashing algorithm with its mandatory overflow handling. I opted for simplicity assuming it will translate into speed ... and it does. On my laptop with an i5 M 480 @ 2.67 GHz an average lookup requires between 55 and 60 clock cycles (comes out to around 45 million lookups per second). I implemented a special get operation with a constant number of indeces and ditto rehash value and the cycle count dropped to 40 (65 million lookups per second). If you look at the line calling getOpSpec the index i is xor'ed with 0x444 to exercise the caches to achieve more "real world"-like results.

I must again point out that the program suggests suitable combinations for the specific data. Other data may require a different combo.

The source code contains both the code for generating the 16000 unsigned long long pairs and for testing different constants (index sizes and rehash values):

#include <windows.h>

#define i8    signed char
#define i16          short
#define i32          long
#define i64          long long
#define id  i64
#define u8           char
#define u16 unsigned short
#define u32 unsigned long
#define u64 unsigned long long
#define ud  u64

#include <string.h>
#include <stdio.h>

u64 prime_find_next     (const u64 value);
u64 prime_find_previous (const u64 value);

static inline volatile unsigned long long rdtsc_to_rax (void)
{
  unsigned long long lower,upper;

  asm volatile( "rdtsc\n"
                : "=a"(lower), "=d"(upper));
  return lower|(upper<<32);
}

static u16 index[65536];

static u64 nindeces,rehshFactor;

static struct PAIRS {u64 oval,rval;} pairs[16000] = {
#include "pairs.h"
};

struct HASH_STATS
{
  u64 ninvocs,nrhshs,nworst;
} getOpStats,putOpStats;

i8 putOp (u16 index[], const struct PAIRS data[], const u64 oval, const u64 ci)
{
  u64 nworst=1,ho,h,i;
  i8 success=1;

  ++putOpStats.ninvocs;
  ho=oval%nindeces;
  h=ho;
  do
  {
    i=index[h];
    if (i==0xffff)    /* unused position */
    {
      index[h]=(u16)ci;
      goto added;
    }
    if (oval==data[i].oval) goto duplicate;

    ++putOpStats.nrhshs;
    ++nworst;

    h+=rehshFactor;
    if (h>=nindeces) h-=nindeces;
  } while (h!=ho);

  exhausted:    /* should not happen */
  duplicate:
    success=0;

  added:
  if (nworst>putOpStats.nworst) putOpStats.nworst=nworst;

  return success;
}

i8 getOp (u16 index[], const struct PAIRS data[], const u64 oval, u64 *rval)
{
  u64 ho,h,i;
  i8 success=1;

  ho=oval%nindeces;
  h=ho;
  do
  {
    i=index[h];
    if (i==0xffffu) goto not_found;    /* unused position */

    if (oval==data[i].oval)
    {
      *rval=data[i].rval;    /* fetch the replacement value */
      goto found;
    }

    h+=rehshFactor;
    if (h>=nindeces) h-=nindeces;
  } while (h!=ho);

  exhausted:
  not_found:    /* should not happen */
    success=0;

  found:

  return success;
}

volatile i8 stop = 0;

int main (int argc, char *argv[])
{
  u64 i,rval,mulup,divdown,start;
  double ave;

  SetThreadAffinityMask (GetCurrentThread(), 0x00000004ull);

  divdown=5;   //5
  while (divdown<=100)
  {
    mulup=3;  // 3
    while (mulup<divdown)
    {
      nindeces=16000;
      while (nindeces<65500)
      {
        nindeces=   prime_find_next     (nindeces);
        rehshFactor=nindeces*mulup/divdown;
        rehshFactor=prime_find_previous (rehshFactor);

        memset (index, 0xff, sizeof(index));
        memset (&putOpStats, 0, sizeof(struct HASH_STATS));

        i=0;
        while (i<16000)
        {
          if (!putOp (index, pairs, pairs[i].oval, (u16) i)) stop=1;

          ++i;
        }

        ave=(double)(putOpStats.ninvocs+putOpStats.nrhshs)/(double)putOpStats.ninvocs;
        if (ave<1.5 && putOpStats.nworst<15)
        {
          start=rdtsc_to_rax ();
          i=0;
          while (i<16000)
          {
            if (!getOp (index, pairs, pairs[i^0x0444]. oval, &rval)) stop=1;
            ++i;
          }
          start=rdtsc_to_rax ()-start+8000;   /* 8000 is half of 16000 (pairs), for rounding */

          printf ("%u;%u;%u;%u;%1.3f;%u;%u\n", (u32)mulup, (u32)divdown, (u32)nindeces, (u32)rehshFactor, ave, (u32) putOpStats.nworst, (u32) (start/16000ull));

          goto found;
        }

        nindeces+=2;
      }
      printf ("%u;%u\n", (u32)mulup, (u32)divdown);

      found:
      mulup=prime_find_next (mulup);
    }
    divdown=prime_find_next (divdown);
  }

  SetThreadAffinityMask (GetCurrentThread(), 0x0000000fu);

  return 0;
}

It was not possible to include the generated pairs file (an answer is apparently limited to 30000 characters). But send a message to my inbox and I'll mail it.

And these are the results:

3;5;35569;21323;1.390;14;73
3;7;33577;14389;1.435;14;60
5;7;32069;22901;1.474;14;61
3;11;35107;9551;1.412;14;59
5;11;33967;15427;1.446;14;61
7;11;34583;22003;1.422;14;59
3;13;34253;7901;1.439;14;61
5;13;34039;13063;1.443;14;60
7;13;32801;17659;1.456;14;60
11;13;33791;28591;1.436;14;59
3;17;34337;6053;1.413;14;59
5;17;32341;9511;1.470;14;61
7;17;32507;13381;1.474;14;62
11;17;33301;21529;1.454;14;60
13;17;34981;26737;1.403;13;59
3;19;33791;5333;1.437;14;60
5;19;35149;9241;1.403;14;59
7;19;33377;12289;1.439;14;97
11;19;34337;19867;1.417;14;59
13;19;34403;23537;1.430;14;61
17;19;33923;30347;1.467;14;61
3;23;33857;4409;1.425;14;60
5;23;34729;7547;1.429;14;60
7;23;32801;9973;1.456;14;61
11;23;33911;16127;1.445;14;60
13;23;33637;19009;1.435;13;60
17;23;34439;25453;1.426;13;60
19;23;33329;27529;1.468;14;62
3;29;32939;3391;1.474;14;62
5;29;34543;5953;1.437;13;60
7;29;34259;8263;1.414;13;59
11;29;34367;13033;1.409;14;60
13;29;33049;14813;1.444;14;60
17;29;34511;20219;1.422;14;60
19;29;33893;22193;1.445;13;61
23;29;34693;27509;1.412;13;92
3;31;34019;3271;1.441;14;60
5;31;33923;5449;1.460;14;61
7;31;33049;7459;1.442;14;60
11;31;35897;12721;1.389;14;59
13;31;35393;14831;1.397;14;59
17;31;33773;18517;1.425;14;60
19;31;33997;20809;1.442;14;60
23;31;34841;25847;1.417;14;59
29;31;33857;31667;1.426;14;60
3;37;32569;2633;1.476;14;61
5;37;34729;4691;1.419;14;59
7;37;34141;6451;1.439;14;60
11;37;34549;10267;1.410;13;60
13;37;35117;12329;1.423;14;60
17;37;34631;15907;1.429;14;63
19;37;34253;17581;1.435;14;60
23;37;32909;20443;1.453;14;61
29;37;33403;26177;1.445;14;60
31;37;34361;28771;1.413;14;59
3;41;34297;2503;1.424;14;60
5;41;33587;4093;1.430;14;60
7;41;34583;5903;1.404;13;59
11;41;32687;8761;1.440;14;60
13;41;34457;10909;1.439;14;60
17;41;34337;14221;1.425;14;59
19;41;32843;15217;1.476;14;62
23;41;35339;19819;1.423;14;59
29;41;34273;24239;1.436;14;60
31;41;34703;26237;1.414;14;60
37;41;33343;30089;1.456;14;61
3;43;34807;2423;1.417;14;59
5;43;35527;4129;1.413;14;60
7;43;33287;5417;1.467;14;61
11;43;33863;8647;1.436;14;60
13;43;34499;10427;1.418;14;78
17;43;34549;13649;1.431;14;60
19;43;33749;14897;1.429;13;60
23;43;34361;18371;1.409;14;59
29;43;33149;22349;1.452;14;61
31;43;34457;24821;1.428;14;60
37;43;32377;27851;1.482;14;81
41;43;33623;32057;1.424;13;59
3;47;33757;2153;1.459;14;61
5;47;33353;3547;1.445;14;61
7;47;34687;5153;1.414;13;59
11;47;34519;8069;1.417;14;60
13;47;34549;9551;1.412;13;59
17;47;33613;12149;1.461;14;61
19;47;33863;13687;1.443;14;60
23;47;35393;17317;1.402;14;59
29;47;34747;21433;1.432;13;60
31;47;34871;22993;1.409;14;59
37;47;34729;27337;1.425;14;59
41;47;33773;29453;1.438;14;60
43;47;31253;28591;1.487;14;62
3;53;33623;1901;1.430;14;59
5;53;34469;3229;1.430;13;60
7;53;34883;4603;1.408;14;59
11;53;34511;7159;1.412;13;59
13;53;32587;7963;1.453;14;60
17;53;34297;10993;1.432;13;80
19;53;33599;12043;1.443;14;64
23;53;34337;14897;1.415;14;59
29;53;34877;19081;1.424;14;61
31;53;34913;20411;1.406;13;59
37;53;34429;24029;1.417;13;60
41;53;34499;26683;1.418;14;59
43;53;32261;26171;1.488;14;62
47;53;34253;30367;1.437;14;79
3;59;33503;1699;1.432;14;61
5;59;34781;2939;1.424;14;60
7;59;35531;4211;1.403;14;59
11;59;34487;6427;1.420;14;59
13;59;33563;7393;1.453;14;61
17;59;34019;9791;1.440;14;60
19;59;33967;10937;1.447;14;60
23;59;33637;13109;1.438;14;60
29;59;34487;16943;1.424;14;59
31;59;32687;17167;1.480;14;61
37;59;35353;22159;1.404;14;59
41;59;34499;23971;1.431;14;60
43;59;34039;24799;1.445;14;60
47;59;32027;25471;1.499;14;62
53;59;34019;30557;1.449;14;61
3;61;35059;1723;1.418;14;60
5;61;34351;2803;1.416;13;60
7;61;35099;4021;1.412;14;59
11;61;34019;6133;1.442;14;60
13;61;35023;7459;1.406;14;88
17;61;35201;9803;1.414;14;61
19;61;34679;10799;1.425;14;101
23;61;34039;12829;1.441;13;60
29;61;33871;16097;1.446;14;60
31;61;34147;17351;1.427;14;61
37;61;34583;20963;1.412;14;59
41;61;32999;22171;1.452;14;62
43;61;33857;23857;1.431;14;98
47;61;34897;26881;1.431;14;60
53;61;33647;29231;1.434;14;60
59;61;32999;31907;1.454;14;60
3;67;32999;1471;1.455;14;61
5;67;35171;2621;1.403;14;59
7;67;33851;3533;1.463;14;61
11;67;34607;5669;1.437;14;60
13;67;35081;6803;1.416;14;61
17;67;33941;8609;1.417;14;60
19;67;34673;9829;1.427;14;60
23;67;35099;12043;1.415;14;60
29;67;33679;14563;1.452;14;61
31;67;34283;15859;1.437;14;60
37;67;32917;18169;1.460;13;61
41;67;33461;20443;1.441;14;61
43;67;34313;22013;1.426;14;60
47;67;33347;23371;1.452;14;61
53;67;33773;26713;1.434;14;60
59;67;35911;31607;1.395;14;58
61;67;34157;31091;1.431;14;63
3;71;34483;1453;1.423;14;59
5;71;34537;2423;1.428;14;59
7;71;33637;3313;1.428;13;60
11;71;32507;5023;1.465;14;79
13;71;35753;6529;1.403;14;59
17;71;33347;7963;1.444;14;61
19;71;35141;9397;1.410;14;59
23;71;32621;10559;1.475;14;61
29;71;33637;13729;1.429;14;60
31;71;33599;14657;1.443;14;60
37;71;34361;17903;1.396;14;59
41;71;33757;19489;1.435;14;61
43;71;34583;20939;1.413;14;59
47;71;34589;22877;1.441;14;60
53;71;35353;26387;1.418;14;59
59;71;35323;29347;1.406;14;59
61;71;35597;30577;1.401;14;59
67;71;34537;32587;1.425;14;59
3;73;34613;1409;1.418;14;59
5;73;32969;2251;1.453;14;62
7;73;33049;3167;1.448;14;61
11;73;33863;5101;1.435;14;60
13;73;34439;6131;1.456;14;60
17;73;33629;7829;1.455;14;61
19;73;34739;9029;1.421;14;60
23;73;33071;10399;1.469;14;61
29;73;33359;13249;1.460;14;61
31;73;33767;14327;1.422;14;59
37;73;32939;16693;1.490;14;62
41;73;33739;18947;1.438;14;60
43;73;33937;19979;1.432;14;61
47;73;33767;21739;1.422;14;59
53;73;33359;24203;1.435;14;60
59;73;34361;27767;1.401;13;59
61;73;33827;28229;1.443;14;60
67;73;34421;31583;1.423;14;71
71;73;33053;32143;1.447;14;60
3;79;35027;1327;1.410;14;60
5;79;34283;2161;1.432;14;60
7;79;34439;3049;1.432;14;60
11;79;34679;4817;1.416;14;59
13;79;34667;5701;1.405;14;59
17;79;33637;7237;1.428;14;60
19;79;34469;8287;1.417;14;60
23;79;34439;10009;1.433;14;60
29;79;33427;12269;1.448;13;61
31;79;33893;13297;1.445;14;61
37;79;33863;15823;1.439;14;60
41;79;32983;17107;1.450;14;60
43;79;34613;18803;1.431;14;60
47;79;33457;19891;1.457;14;61
53;79;33961;22777;1.435;14;61
59;79;32983;24631;1.465;14;60
61;79;34337;26501;1.428;14;60
67;79;33547;28447;1.458;14;61
71;79;32653;29339;1.473;14;61
73;79;34679;32029;1.429;14;64
3;83;35407;1277;1.405;14;59
5;83;32797;1973;1.451;14;60
7;83;33049;2777;1.443;14;61
11;83;33889;4483;1.431;14;60
13;83;35159;5503;1.409;14;59
17;83;34949;7151;1.412;14;59
19;83;32957;7541;1.467;14;61
23;83;32569;9013;1.470;14;61
29;83;33287;11621;1.474;14;61
31;83;33911;12659;1.448;13;60
37;83;33487;14923;1.456;14;62
41;83;33587;16573;1.438;13;60
43;83;34019;17623;1.435;14;60
47;83;31769;17987;1.483;14;62
53;83;33049;21101;1.451;14;61
59;83;32369;23003;1.465;14;61
61;83;32653;23993;1.469;14;61
67;83;33599;27109;1.437;14;61
71;83;33713;28837;1.452;14;61
73;83;33703;29641;1.454;14;61
79;83;34583;32911;1.417;14;59
3;89;34147;1129;1.415;13;60
5;89;32797;1831;1.461;14;61
7;89;33679;2647;1.443;14;73
11;89;34543;4261;1.427;13;60
13;89;34603;5051;1.419;14;60
17;89;34061;6491;1.444;14;60
19;89;34457;7351;1.422;14;79
23;89;33529;8663;1.450;14;61
29;89;34283;11161;1.431;14;60
31;89;35027;12197;1.411;13;59
37;89;34259;14221;1.403;14;59
41;89;33997;15649;1.434;14;60
43;89;33911;16127;1.445;14;60
47;89;34949;18451;1.419;14;59
53;89;34367;20443;1.434;14;60
59;89;33791;22397;1.430;14;59
61;89;34961;23957;1.404;14;59
67;89;33863;25471;1.433;13;60
71;89;35149;28031;1.414;14;79
73;89;33113;27143;1.447;14;60
79;89;32909;29209;1.458;14;61
83;89;33617;31337;1.400;14;59
3;97;34211;1051;1.448;14;60
5;97;34807;1789;1.430;14;60
7;97;33547;2417;1.446;14;60
11;97;35171;3967;1.407;14;89
13;97;32479;4349;1.474;14;61
17;97;34319;6011;1.444;14;60
19;97;32381;6337;1.491;14;64
23;97;33617;7963;1.421;14;59
29;97;33767;10093;1.423;14;59
31;97;33641;10739;1.447;14;60
37;97;34589;13187;1.425;13;60
41;97;34171;14437;1.451;14;60
43;97;31973;14159;1.484;14;62
47;97;33911;16127;1.445;14;61
53;97;34031;18593;1.448;14;80
59;97;32579;19813;1.457;14;61
61;97;34421;21617;1.417;13;60
67;97;33739;23297;1.448;14;60
71;97;33739;24691;1.435;14;60
73;97;33863;25471;1.433;13;60
79;97;34381;27997;1.419;14;59
83;97;33967;29063;1.446;14;60
89;97;33521;30727;1.441;14;60

Cols 1 and 2 are used to calculate a rough relationship between the rehash value and the index size. The next two are the first index size/rehash factor combination which averages less than 1.5 searches for a lookup with a worst case of 14 searches. Then average and worst case. Finally, the last column is the average number of clock cycles per lookup. It does not take into account the time required to read the time stamp register.

The actual memory space for the best constants (# of indeces = 31253 and rehash factor = 28591) comes out to more than I initially indicated (16000*2*8 + 1,25*16000*2 => 296000 bytes). The actual size is 16000*2*8+31253*2 => 318506.

The fastest combination is an approximate ratio of 11/31 with an index size of 35897 and rehash value of 12721. This will average 1.389 (1 initial hash + 0.389 rehashes) with a maximum of 14 (1+13).

________ EDIT________

I removed the "goto found;" in main () to show all combinations and it shows that much better performance is possible, of course at the expense of a larger index size. For example the combination 57667 and 33797 yields and average of 1.192 and a maximum rehash of 6. The combination 44543 and 23399 yields a 1.249 average and 10 maximum rehashes (it saves (57667-44543)*2=26468 bytes of index table compared to 57667/33797).

Specialized functions with hard-coded hash index size and rehash factor will execute in 60-70% of the time compared to variables. This is probably due to the compiler (gcc 64-bit) substituting the modulo with multiplications and not having to fetch the values from memory locations as they will be coded as immediate values.

________ EDIT________

On the subject of caches I see two issues.

The first is data cacheing which I don't think will be possible because the lookup will just be a small step in some larger process and you run the risk of the table data's cache lines begin invalidated to a lesser or (probably) greater degree - if not entirely - by other data accesses in other steps of the larger process. I e the more code executed and data accessed in the process as a whole the less likely it will be that any pertinent lookup data will remain in the caches (this may or may not be pertinent to the OP's situation). To find an entry using (my) hashing you will encounter two cache misses (one to load the correct part of the index, and the other to load the area containg the entry itself) for every comparison that needs to be performed. Finding an entry on the first try will have cost two misses, the second try four etc. In my example the 60 clock cycle average cost per lookup implies that the table probably resided entirely in the L2 cache and with L1 not having to go there in a majority of the cases. My x86-64 CPU has L1-3, RAM wait states of approximately 4, 10, 40 and 100 which to me shows that RAM was completely kept out and L3 mostly.

The second is code cacheing which will have a more significant impact if it is small, tight, in-lined and with few control transfers (jumps and calls). My hash routine probably resides entirely in the L1 code cache. For more normal cases, the fewer the number of code cache line loads the faster it will be.

...and still no code. I'll let you in on a secret. The reason I've been pushing you for code, is because while bsearch() qsort() are simple, easy, and in the lib, you would have to make a substantial investment of time to create a hash function to satisfy the requirements above. IE: being so reluctant to do the work of creating and performance testing the proposed hash function, you're making my point for me. Also, my claim is not that the rehash is computationally expensive, it's that other things will hash to the same rehash location, causing cascading failures. — , Dec 21 '12 at 08:23
Then I'll let you in on a secret. The OP wrote "fastest way to store and retrieve" which you and I obviously read quite differently. Though I can't look inside your head I assume you have focused on "fastest way to get something up and running" whereas I read it as "the most processing efficient way (i e fastest execution time)." Nobody with any significant knowledge on hashing - oh yes, I am one of them - will claim that efficient hash algorithms are easy to implement. But that was not an issue the OP appeared interested in so I didn't assume he was. — Olof Forshell, Dec 21 '12 at 12:29
My RX to the OP was based on balancing the potential minor improvement in performance (a bad hash would perform worse) against a very large investment of time. Since other's here had already benchmarked bsearch() and found it nearly an order of magnitude beyond the OP's requirements that was the correct call. Now if money/time is no object, and you want to learn a lot about writing hash routines, then you might want to use a hash, but the OP was looking for off-the-shelf solutions, so that wasn't his situation. — , Dec 21 '12 at 23:24
Like I said: our respective understandings of the word "fastest" are quite different. I read it literally and try to answer the question accordingly. You read in lot's of other things which may or may not be pertinent. I guess if someone were to ask us "what's the fastest way to lap at Indianapolis" I'd give a backgrounder on racing car construction (if that were my field of expertise) and you'd say buy a standard car, drop in the biggest engine you can find and drive. So my approach is to explain from a theoretical/practical standpoint and yours is time to market. — Olof Forshell, Dec 22 '12 at 07:37
I'll let you grind your axe in peace. You've proven nothing, except you're willing to make unsubstantiated claims, being too busy arguing your point in a vacuum here to write some actual code. I've yet to meet a hash bigot that didn't take it as axiomatic that if the hash didn't outperform the bsearch() it wasn't a good enough hash function - no matter how much time was invested in it. — , Dec 22 '12 at 22:55
I've seen some good benchmarks comparing the STL's map and unordered set containers, and for those the unordered set (aka, a hashtable) was about 3X as fast. Not sure about search(). The Achilles Heel of hash tables is an unknown # of values, and in some cases, like large, variable-length strings, the poor performance of the hash function. Search would also require that the number of values be known up front. — user2548100, Dec 18 '13 at 01:21
@RocketRoy: you'll find them towards the end. Average number of comparisons, max number of rehashes and average number of cpu cycles — Olof Forshell, Oct 08 '14 at 09:09
Thanks for all your work Olof. It deserves a serious read, so let me do that before commenting much further. Cheers! PS: arrays are a full 1000x faster than any of these associative data structures - on the rare occasions when you have completely dense, contiguous keys. — , Oct 09 '14 at 07:45
@OlofForshell, I think you edged me out by just a hair with your custom hash function, depending on the specific pattern of keys being searched for, while I easily best the STL's unordered_map. If the OP had 32k values, Bsearch2() would win, and by increasingly wider margins as N (the number of values) increases. — , Oct 18 '14 at 19:29
@OlofForshell, I'm looking for the pairs.h file. Is that the file you tried to post and it got truncated? If so, please send it via email. TVMIA. — , Oct 22 '14 at 06:09
@RocketRoy: sure, where do I send it (750KiB)? Is there an email service within stackoverflow? Lokked but didn't find one. — Olof Forshell, Oct 22 '14 at 10:38

dchhetri · Answer 2 · 2012-12-22T03:12:02.280

1

Perform memonization, or in simple terms, cache the values you've computed already and calculate the new ones. You should hash the input and check the cache for that result. You can even start off with a set of cache values that you think the function would get called more often for. Besides that, I don't think you need to go to any extreme as the other answer suggest. Do things simple and when you are done with your application you can use a profiling tool to find bottle necks.

EDIT: Some code

#include <iostream>
#include <ctime>
using namespace std;

const int MAX_SIZE = 16000;

int preCalcData[MAX_SIZE] = {};

int getPrecalculatedResult(int x){
 return preCalcData[x];
}

void setupPreCalcDataCache(){
  for(int i = 0; i < MAX_SIZE; ++i){
    preCalcData[i] = i*i; //or whatever calculation
  }
}

int main(){
  setupPreCalcDataCache();

  cout << getPrecalculatedResult(0) << endl;
  cout << getPrecalculatedResult(15999) << endl;

  return 0;
}

edited Dec 22 '12 at 03:12

answered Dec 16 '12 at 05:39

dchhetri

6,926
4
43
56

Well OP doesn't necessarily need to convert the numerical value to string in order to get a hash? I think this should be faster than the binary search method – dchhetri Dec 17 '12 at 20:37
Asking a Hash function "have I already done this" 16,000 times cannot possibly be faster than already knowing you have. Read the OP's spec. It's a completely redundant question. For the Nth time, Hashing CAN be faster than searching, but unless you know a lot about the data, and know it's size, it seldom is because most data has hot-spots that create cascading collisions. It's nearly impossible to create a perfect Hash function. Conjecture is useless. FASTER can only be answered by benchmarking. Provide some code and we'll know. – Dec 21 '12 at 08:11
To already know you have hashed you have to ask first right? Anyways, if we can spare memory then it will be a constant lookup and code will be very trivial. I say it is feasible in terms of memory because it is only 16000 ints. So as a result, we can simply do CALC[x] and not even hash x. Maybe I am missing something here. – dchhetri Dec 21 '12 at 19:05
Made edit to post. It should be simple as that, since the space needed for this problem is relatively small. – dchhetri Dec 22 '12 at 03:12
From the OP's list of assumptions... "Now, given that I know ALL possible inputs (the xs) of that function beforehand". Given this, there is no need to compute values on the fly. Just compute them all up front and retrieve them as fast as possible at run-time. – Oct 18 '14 at 01:13

score 1 · Answer 3 · answered Dec 16 '12 at 05:47

1

Make an array of structures of key val pairs.

Sort the array by key, put this in your program as static array, would only be 128kbyte.

Then in your program a simple binary look up by key will need on average only 14 key comparisons to find the right value. Should be able to approach speeds of 300 million look ups per second on modern pc.

You can sort with qsort and search with bsearch, both std lib functions.

answered Dec 16 '12 at 05:47

Myforwik

19
1

Simple, straight-forward, and robust. I would use this every time for the problem as stated. Hashes are fast, but only if you know the data well, and know it won't change. Typical hash use is for key-words in a parser. You know the words, their character content & distribution, and number, so you can optimize the hash. Not your problem AFAICT. – Dec 16 '12 at 07:23
I would only allocate the storage on the stack if the number (~16,000) never changes. Otherwise you'll either have to constantly change the array size, allocate one arbitrarily large, or have your program blow up. IE: a maintenance nightmare. Use malloc() and allocate the exact storage you need. Foolproof. – Dec 16 '12 at 07:39
300 million lookups per second. On a 3GHz machine this comes out to one lookup every 10 Hz. This means one comparison per 0.7 Hz. How do you propose to handle the (at least) three instructions (fetch, compare and conditional jump) for the comparison, several to find the new comparison position not to mention cache misses? Mutliple cores are of no help in case you didn't know. Back to the drawing board for you. – Olof Forshell Dec 16 '12 at 08:22
@RocketRoy: if you have a need for speed you use the tools that can be brought to bear. Granted a binary search is faster to implement but will be up to an order of magnitude slower than a properly functioning hash routine. Foolproof? Not if you want the fastest possible solution. – Olof Forshell Dec 16 '12 at 08:31
@OlofForshell ..and what do you do about collisions? Call malloc() and resort to linear searches of the extensions? You can't avoid collisions unless you know a lot about the nature of the data, AND, are willing to allocate at least 125% of the space that will be used. I've used and tested both extensively. Hashes never live up to their theoretical performance potential with this kind of data, and they take a lot of time to get right if you do know your data. Rehashes kill performance. Total wipeout. – Dec 16 '12 at 09:08
@OlofForshell Your numerical analysis is correct. 300 million per second is not a reasonable estimate, but 3 million per second is the upper limit of performance requested, and that is definitely in range. Only way to know is to benchmark it. Keep in mind that all the center splits are going to be in the data cache, since they are always the same elements. With rehashes you are off the processor, waiting for memory. – Dec 16 '12 at 09:20
@RocketRoy: there are many ways to handle collisions. I don't see how mallocs would apply to a fixed-size list of pairs though. Re-hashes need be no more complicated than pointing to the next possible index relative to the current one. Total wipeout? What? Where? Who? You? – Olof Forshell Dec 16 '12 at 09:21
Once you run out of primary slots, you either rob other hash locations in the table, or extend the table by allocating more storage with malloc(). The former causes cascading collisions with follow-on hashes, the latter leads to waiting for malloc() building the table, and doing a linear search of memory NOT on the cache on retrieval. Your alleged speed advantage has just been wiped out. (and apparently, your manners) You're making an assertion here. Write some code, post it, and we'll benchmark it - or concede the point. – Dec 16 '12 at 09:25
@RocketRoy: the upper limit was "some thousands-to-millions of times, per second." I don't see an upper limit of three million there. Besides I'm assuming the computer in question will be doing other things than just computing the hash key. So if we take three million searches taking up 10% of the CPU capacity we need a capacity of thirty million at 100%. The problem won't go away. – Olof Forshell Dec 16 '12 at 09:26
@RocketRoy: sixteen million pairs pointed to by an index table of twenty million entries. "Rob other hash locations?" What, from themselves? Is there only one way to handle collisions - i e your way? – Olof Forshell Dec 16 '12 at 09:28
Looks like you need to reread the spec. " NOT consecutive (e.g. not from 1 to 16000," You're still just talk. Prove your claim, or concede the point. – Dec 16 '12 at 09:41
@RocketRoy: my method uses the existing index to handle collisions. It is 25% larger than the number of data items it points to for this very reason. With a good algorithm - not all that easy to achieve for sure - the number of collisions can be kept at bay. On a different note: your usage of expressions such as "foolproof" and "total wipeout" give the impression of you sitting as some sort of almighty "CS grader" - what a joke. As are my manners. – Olof Forshell Dec 16 '12 at 09:45
@RocketRoy: "reread" you say. An index vector pointing to list of entries (in this case u64 pairs). Ooops I see I wrote "index table pointing to ..." In any case it's a common method used in hash tables, sort situations etc to reduce processing. The OP states "16000 thousand" which I assume means sixteen million. A somewhat larger task than your 16000. – Olof Forshell Dec 16 '12 at 09:52
There was conflicting info, but the OP has responded to my request for clarification. 16,000 it is. You are correct that 3 orders of magnitude would change the nature of the problem. – Dec 16 '12 at 22:43
I've written and benchmarked everything on this page, so with respect to Visual C and Intel processors, I have a good handle on relative performance of the suggested solutions. The OP has a question. I have an answer. Nothing more, nothing less. – Dec 16 '12 at 22:44
16000 * (2 u64 + 2 u32) = 384000. That's not possible for an L1 cache and most L2s. Especially if the application does more than just do look-ups. So your data - if you're lucky - will reside mostly in L3. My data requires less space - 16000 * (2 u64) + 20000 * u16 = 296000 bytes - and the separation of index from list should help more of the data to stay at a higher average cache level. Why do I use u32s for your data and not for mine? Efficient (struct) alignment inside the object. If you use u16s you'll get a 4 byte "hole" inside the object. – Olof Forshell Dec 21 '12 at 12:52
@OlofForshell, you probably realize by now that a binary search, no matter what key is searched for, will always probe the exact same elements in the top of the "tree", so those do indeed get cached. I've bsearch()-ing a struct{ulong;ulong:} so 16k * 8 = 131,072. No hole compiling with 1-byte struct alignment. My sorted array needs NO additional padding to avoid collisions. Everything is resolved precisely to an array index. – Oct 16 '14 at 22:29
@RocketRoy: cacheing must certainly be taken into account although in this particular application an average of 60 clock cycles per generic search or 36-42 in a hard-coded one is not bettered without effort. My example xors the index to lessen the impact of caches, a form of randomizing access to the pairs. – Olof Forshell Oct 17 '14 at 15:39

Olaf Dietsche · Answer 4 · 2012-12-16T06:25:13.567

0

I wouldn't worry about performance too much. This simple example, using an array and binary search lower_bound

#include <stdint.h>
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <memory>

const int N = 16000;
typedef std::pair<uint64_t, uint64_t> CALC;
CALC calc[N];

static inline bool cmp_calcs(const CALC &c1, const CALC &c2)
{
    return c1.first < c2.first;
}

int main(int argc, char **argv)
{
    std::iostream::sync_with_stdio(false);
    for (int i = 0; i < N; ++i)
        calc[i] = std::make_pair(i, i);

    std::sort(&calc[0], &calc[N], cmp_calcs);

    for (long i = 0; i < 10000000; ++i) {
        int r = rand() % 16000;
        CALC *p = std::lower_bound(&calc[0], &calc[N], std::make_pair(r, 0), cmp_calcs);
        if (p->first == r)
            std::cout << "found\n";
    }

    return 0;
}

and compiled with

g++ -O2 example.cpp

does, including setup, 10,000,000 searches in about 2 seconds on my 5 year old PC.

edited Dec 16 '12 at 06:25

answered Dec 16 '12 at 05:09

Olaf Dietsche

72,253
8
102
198

1

YOU wouldn't worry but it appears the OP may well be. – Olof Forshell Dec 16 '12 at 08:47
@OlofForshell Within the constraints of about 16000 entries with some thousands to millions lookups per second, this is well achieved with my old PC. A current machine should do a lot better. So, when the goal is met, *I* wouldn't worry. ;-) – Olaf Dietsche Dec 16 '12 at 13:19
@OlafDietsche, how about 40 million per second? – Oct 17 '14 at 22:45
@RocketRoy I guess, this depends on the hardware you have available. When you consider the search alone (no rand(), no output), this should be easily doable. My current PC (2013, AMD A10) does 40 million in 1.5 seconds. – Olaf Dietsche Oct 18 '14 at 11:31
Very true. Did you use my Bsearch2() code below for your benchmark? How many clocks/search? I realized about a year ago that at least 2 generations of programmers have come up with cheap, ubiquitous databases like MySQL, and so have no clue how to manage structured data any other way. This was in part my motivation for engaging on this question. – Oct 18 '14 at 18:51

didierc · Answer 5 · 2012-12-16T16:24:12.793

0

You need to store 16 thousand values efficiently, preferably in memory. We are assuming that the computation of these values is more time consuming than accessing them from storage.

You have at your disposal many different data structures to get the job done, including databases. If you access these values in queriable chunks, then the DB overhead may very well be absorbed and spread accross your processing.

You mentioned map and hashmap (or hashtable) already in your question tags, but these are probably not the best possible answers for your problem, although they could do a fair job, provided that the hashing function isn't more expensive than the direct computation of the target UINT64 value, which has to be your reference benchmark.

Van Emde Boas Trees
Many variants of B-Trees (used extensively in database engines, high performance filesystems),
Tries

Are probably much better suited. Having some experience with it, I would probably go for a B-tree: they support fairly well serialization. That should let you prepare your dataset in advance in a different program. VEB trees have a very good access time (O(log log(n)), but I don't know how easily they may be serialized.

Later on, if you need even more performance, it would also be interesting to know usage patterns of your "database" to figure out what caching techniques you could implement on top of the store.

edited Dec 16 '12 at 16:24

answered Dec 16 '12 at 05:28

didierc

14,572
3
32
52

This answer is clearly incorrect. Trees are wonderful things, but completely inappropriate where you know how many entries you need to store in advance. Their advantage is they are self-extending. They are expensive though, and take up a lot of memory. Hash or bsearch() are the correct answers. Databases are too slow by many thousands of times. – Dec 16 '12 at 07:15
@RocketRoy databases ate slow because they need to parse and complie high level language queries (SQL like), but when it comes to indexing, the algorithms they use are fast enough. The problem is to find a good tradeof between speed and ease of use, and b-trees are good candidates for that. – didierc Dec 16 '12 at 16:13
@RocketRoy thank you for your kind advices. I'd never used bsearch before, and it seems a very handy function to do dichotomic search on a sorted array. Now may I kindly suggest that you do the same and educate yourself on what a B-tree exaclty is? And I am not talking about binary trees. – didierc Dec 17 '12 at 08:42
You have no way to access the internals of the database, and even if you did, B-trees are used for retrieving blocks of data off of disk, 10 yrs ago 16k blocks, and now 64k blocks, or even larger. The base of the block is navigated to just like a binary tree, but inside the block data is found by doing a linear search. IE: an element by element comparison. This slows them down to a crawl relative to entirely memory-based data structures, but is faster for disk-based as it reduces the number of disk searches for any given number of entries. – Oct 09 '14 at 07:57
A better compromise, given that the OP knows the number of entries up front, is to put the sorted array on disk and just do the probes against the disk. As the probes in the upper part of the search are always the same, they get cached, so this is surprisingly fast and extremely memory efficient. Using memory-mapped files increases performance substantially as the upper probes in many simultaneous searches are able to use the CPU's MMU to intelligently cache probes. I wrote a rather extensive OLAP DB using this approach and was very happy with performance and maintenance. – Oct 09 '14 at 08:03
A variation of this is to leave a sorted array that serves as an index in memory, and bsearch() that, where what's associated with the sorted key is the record number on disk holding the corresponding data. – Oct 10 '14 at 00:17

score -2 · Answer 6 · answered Dec 16 '12 at 05:16

-2

Using std::pair is better than any of map for speed.

but if I were you, I firstly use a std::list to store the data, after I got them all, I move them into a simple vector, then retrieving goes very fast if you implement a simple binary tree search by yourself.

answered Dec 16 '12 at 05:16

tomriddle_1234

3,145
6
41
71

There is absolutely no reason to use a list, which has the worst performance of any data-structure for this problem. The OP made it VERY clear he knows all the keys up front. Just allocate a vector, or use push_back() if you must, and be done with it. DS&A 101. Also, pair is a tupple/struct{}, it has no implication for speed. It is irrelavant whether you search a pair or a struct{}, as internally they're treated the same and take up the exact same storage. – Oct 20 '14 at 02:37
I saw some really good answers here, I guess I didn't see the OP's supplement when he opened this question at the first place, there are lots to learn for me here. – tomriddle_1234 Oct 20 '14 at 06:52
Yes, I agree. A lot of thinking went into the answers on this question. IE: it occurred to me that bsearch() could handle keys like SSN or phone number by creating a struct{int_64 SSN_Key:34; int_64 Value:20;} where 34 bits will cover 0-9,999,999,999 and 20 bits is enough room for the user's value/s - like a short and a char, or truncated long int. Finally, imagine you have big records on disk, and just want to bsearch() the key column. If you subtract the returned ptr from the base of the index array you get the corresponding record number on disk with just a key, not a key|value pair. – Oct 20 '14 at 16:34
I got to leave my fault answer here, to make your effort. – tomriddle_1234 Oct 22 '14 at 01:45

The fastest way to retrieve 16k Key-Value pairs?

6 Answers6

Linked