2

I have been profiling our application, and found something obvious, the memory allocator is called a lot and consumes significant time (a few percent). Last year I made the memory allocator many times faster, but I still think I can speed it up some more. So as part of that optimization, I want to speed the part of code that quantizes the allocation sizes.

The memory allocator keeps lists of free chunks of memory. There is an array of 832 lists. One list for each allocation size from 0..128k. All allocation requests from 0..128k are converted to one of 832 quanta (is quanta the right word?) 832 is arbitrary and resulted the scheme I came up with below. I was balancing a desire to not waste memory with a desire to have a high amount of reuse. Also, I wanted to use as few bits as possible to store the size of an allocation. In our application, small sizes are requested much more than large sizes - i.e reuse is higher for smaller sizes. Everything is aligned to 8 bytes, so the smallest quanta is 8. I chose to quantize all allocations below 256 bytes to 8 bytes to not waste any more ram than alignment required. Also to save space, when memory is added to a list of free memory, I use the first 8 bytes of the allocated memory for a next pointer, so I can't go below 8 bytes for that reasons too. From 2..8k the request quanta is 32 bytes. From 8..32k it's 128 bytes. From 32..128k it's 512 bytes. As the request size goes up, you can use larger quanta and still keep % of memory you waste low. Because I have only 832 sizes, reuse is high, even for larger/rarer allocations.

Here is the function that quantizes allocation requests. iRecycle is the index into the array of lists. It goes from 0..831

void GetAlignedSize(QWORD cb, QWORD& cbPack8, WORD& iRecycle) {
  // we assume cb is small, so the first 'if' will be hit the most.
  if (cb < 0x000800 - 0x0007) {        //  0k..2k   =   8 byte chunks
    cb += 0x0007; cbPack8 = cb & (~0x0007); // pad to 8
    iRecycle = 000 + WORD(cb >>  3);  
  } else if (cb < 0x002000 - 0x001f) { //  2k..8k   =  32 byte chunks
    cb += 0x001f; cbPack8 = cb & (~0x001f); // pad to 32
    iRecycle = 192 + WORD(cb >>  5);  
  } else if (cb < 0x008000 - 0x007f) { //  8k..32k  = 128 byte chunks
    cb += 0x007f; cbPack8 = cb & (~0x007f); // pad to 128
    iRecycle = 384 + WORD(cb >>  7);  
  } else if (cb < 0x020000 - 0x01ff) { // 32k..128k = 512 byte chunks 
    cb += 0x01ff; cbPack8 = cb & (~0x01ff); // pad to 512
    iRecycle = 576 + WORD(cb >>  9);  
  } else { 
    cbPack8 = Pack8(cb);
    iRecycle = 0;
  }
}

Here is the question! How can I do something similar to this with bit manipulations only. I want to get rid of the compare statement because I think it breaks down cpu pipelining. As long as the quantization increases with size, and the # of sizes below 128k is small, any scheme is viable. I expect this will eliminate the last case and iRecycle will increase without bound, so we can change iRecycle to a different sized integer.

Thanks for your help!

johnnycrash
  • 5,184
  • 5
  • 34
  • 58
  • 1
    How often does the first case get hit, as a percentage of all calls? Also, this appears to be C++, not C, just going by the reference arguments. – user2357112 Jul 16 '14 at 19:57
  • What compiler and optimization flags are you using? Did you try with a recent [GCC](http://gcc.gnu.org/) 4.9 (using `gcc-4.9 -mtune=native -flto -O3` both for compilation & for linking)? – Basile Starynkevitch Jul 16 '14 at 19:59
  • cb of 249 hits the first if. cb of 250-256 hits the second if, but still quantizes to 256. rewrite the first if as if(cb < 0x800) and do the same for all the other ifs too. Marginal improvement, but easier to read. – Airsource Ltd Jul 16 '14 at 20:03
  • Oops sorry. msvc compiler with full optimizations. It is c++, but c is fine. – johnnycrash Jul 16 '14 at 20:03
  • @Air: I originally had these lines layed out one line per case so everything lined up and you could see the bit patterns being applied. Thanks for the idea. – johnnycrash Jul 16 '14 at 20:05
  • Sorry, I have no time for a full answer, but I can give you a basic idea: since a boolean evaluates to either 0 or 1, you can sum them to get a multiplier like this: (cb > 0x800) + (cb > 0x2000) + (cb > 0x8000) + etc; then you make some arithmetic, for example raise 2 to the power of (2*multiplier + 1) to get the pad value. – Dmitry Jul 16 '14 at 22:05
  • A few percent? That's not a lot. Suppose you make it infinitely fast, what do you gain? A few percent. There's almost certainly bigger stuff elsewhere. [*Here's an example of 97.3%, for a 43x speedup.*](http://stackoverflow.com/a/927773/23771) – Mike Dunlavey Jul 17 '14 at 00:48
  • @DmitryVasilyanov is correct. It is the `if` statement that is messing up pipelining, not the comparison itself. – jxh Jul 17 '14 at 07:57
  • @Mike: A few percent does not seem like a lot. However no one who came before me considered performance. Every function in our 800,000 lines of code uses every slow c++ technique you could think of. Then, since our software has so many different possible pathways of execution, through hundreds of large functions, there is not a single "slow function" that could be optimized to make the code even 10% faster. Each % = 6 fewer blades, at about $15k per blade. In the past 1.5 years I have gotten it from 900ms per txn to 300ms per txn. – johnnycrash Jul 17 '14 at 18:03
  • @Mike ... The new memory allocator, already 100x faster than the original, is still at the top of the list since it is one of the top 5 most accessed functions. One angle I am taking is to track down the big users of the heap (with the profiler) and convert them over to stack. At the same time, I think I can remove a few more lines and branches from the critical allocation pathway. If I make it infinitely fast, I would save 5%, which isn't possible. I think there is a chance I can make it 20% faster and get my 1 percent. – johnnycrash Jul 17 '14 at 18:13
  • @johnnycrash: The things you're saying I hear a lot, and when people run profilers on their systems, it seems to reinforce it. But in my experience, systems like that are positively dripping with opportunities for *big* speedup, but *profilers don't find them*. I guess I'm a big-mouth on this subject, but lots of programmers substitute excuses for action. I made a crude 8-minute video of [*how I do it, and plenty of other people use the same technique.*](https://www.youtube.com/watch?v=xPg3sRpdW1U) – Mike Dunlavey Jul 17 '14 at 19:03

1 Answers1

1

The obvious thing to do is to use a table. I was sceptical that such a scheme would be quicker, but curious...

...so, to create a baseline I tweaked your function (rendering it in C):

static uint
GetAlignedSize(size_t size, size_t* rsize)
{
  size_t sx = size - 1 ;

  if (size <= 0x00800)
    {
      *rsize = (sx | 0x007) + 1 ;
      return (sx >> 3) + ((256 - 64) * 0) ;
    } ;

  if (size <= 0x02000)
    {
      *rsize = (sx | 0x01F) + 1 ;
      return (sx >> 5) + ((256 - 64) * 1) ;
    } ;

  if (size <= 0x08000)
    {
      *rsize = (sx | 0x07F) + 1 ;
      return (sx >> 7) + ((256 - 64) * 2) ;
    } ;

  if (size <= 0x20000)
    {
      *rsize = (sx | 0x1FF) + 1 ;
      return (sx >> 9) + ((256 - 64) * 3) ;
    } ;

  *rsize = 0 ;
  return 64 + ((256 - 64) * 4) ;
} ;

Noting that this does sizes upto and including 0x800 with 8 byte units, etc. and returns "index" 0..831 for all the known sizes, and 832 for the out-size (not 1..832 and 0). In passing, I wonder if it is actually useful to know the rounded size ? To free the block you need the "index", and if you need the rounded size that could be looked up ? [Full disclosure: this also assumes that the incoming request size is not zero... which makes a tiny improvement in the timing !]

Anyway, the most general approach is to table drive the whole thing:

static uint
get_aligned_size_1(size_t size, size_t* rsize)
{
  static const uint tb[0x40] =
  {
    /* 0x00 */  (0x007 << 16) + ((256 - 64) * 0)  + 3,

    /* 0x01 */  (0x01F << 16) + ((256 - 64) * 1)  + 5,
    /* 0x02 */  (0x01F << 16) + ((256 - 64) * 1)  + 5,
    /* 0x03 */  (0x01F << 16) + ((256 - 64) * 1)  + 5,

    /* 0x04 */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x05 */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x06 */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x07 */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x08 */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x09 */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x0A */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x0B */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x0C */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x0D */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x0E */  (0x07F << 16) + ((256 - 64) * 2)  + 7,
    /* 0x0F */  (0x07F << 16) + ((256 - 64) * 2)  + 7,

    /* 0x10 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x11 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x12 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x13 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x14 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x15 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x16 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x17 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x18 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x19 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x1A */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x1B */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x1C */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x1D */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x1E */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x1F */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,

    /* 0x20 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x21 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x22 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x23 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x24 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x25 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x26 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x27 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x28 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x29 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x2A */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x2B */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x2C */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x2D */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x2E */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x2F */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,

    /* 0x30 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x31 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x32 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x33 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x34 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x35 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x36 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x37 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x38 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x39 */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x3A */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x3B */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x3C */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x3D */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x3E */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
    /* 0x3F */  (0x1FF << 16) + ((256 - 64) * 3)  + 9,
  } ;

  size_t sx = size - 1 ;

  if (size <= 0x20000)
    {
      uint tx ;

      tx = tb[sx >> 11] ;

      *rsize = (sx | (tx >> 16)) + 1 ;
      return (sx >> (tx & 0xF)) + (tx & 0xFFF0) ;
    } ;

  *rsize = 0 ;
  return 64 + ((256 - 64) * 4) ;
} ;

where the mask to round the size up, the "index" base and the shift required to create the "index" are all in the table... packed manually into a uint.

I timed the two functions over a random selection of request sizes: 80% 1..0x800, 15% 0x801..0x2000, 4% 0x2001..0x8000, 0.9% 0x8001..0x20000 and 0.1% out-size:

Setup:     15.610 secs: user  15.580 system   0.000 -- 500 million
Branches:   1.910 secs: user   1.910 system   0.000
Table 1:    1.840 secs: user   1.830 system   0.000

The setup loop is:

  srand(314159) ;
  for (int i = 0 ; i < trial_count ; ++i)
    {
      int r ;
      size_t mx, mn ;

      r = rand() % 1000 ;
      if      (r < 800)
        {
          mn = 1 ;
          mx = 0x00800 ;
        }
      else if (r < 950)
        {
          mn = 0x00801 ;
          mx = 0x02000 ;
        }
      else if (r < 990)
        {
          mn = 0x02001 ;
          mx = 0x08000 ;
        }
      else if (r < 999)
        {
          mn = 0x08001 ;
          mx = 0x20000 ;
        }
      else
        {
          mn = 0x20001 ;
          mx = 0x80000 ;
        } ;

      test_values[i] = (rand() % (mx - mn + 1)) + mn ;
    } ;

Observe very carefully the time taken to setup 500 million test values. The table version runs ever so slightly faster... (gcc 4.8, -O3)... but this is a 4% improvent on a trivial function ! Mind you, the table driven version is more flexible, and more quanta can be added without changing the code or the run time.

FWIW, the mask can (obviously) be constructed from the shift, so:

static uint
get_aligned_size_5(size_t size, size_t* rsize)
{
  static const uint8_t ts[0x40] =
  {
    /*                0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F
    /* 0x00       */  3,
    /* 0x01..0x03 */     5, 5, 5,
    /* 0x04..0x0F */              7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
    /* 0x10..0x1F */  9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
    /* 0x20..0x2F */  9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
    /* 0x30..0x3F */  9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
  } ;

  static const uint tb[16] =
  {
    /*  0 */    0,
    /*  1 */    0,
    /*  2 */    0,
    /*  3 */    ((256 - 64) / 2) * (3 - 3),
    /*  4 */    0,
    /*  5 */    ((256 - 64) / 2) * (5 - 3),
    /*  6 */    0,
    /*  7 */    ((256 - 64) / 2) * (7 - 3),
    /*  8 */    0,
    /*  9 */    ((256 - 64) / 2) * (9 - 3),
    /* 10 */    0,
    /* 11 */    0,
    /* 12 */    0,
    /* 13 */    0,
    /* 14 */    0,
    /* 15 */    0,
  } ;

  size_t sx = size - 1 ;

  if (size <= 0x20000)
    {
      uint8_t  s ;

      s = ts[sx >> 11] ;
      *rsize = (sx | (((size_t)1 << s) - 1)) + 1 ;
      return (sx >> s) + tb[s] ;
    } ;

  *rsize = 0 ;
  return 64 + ((256 - 64) * 4) ;
} ;

Which I found was the fastest of the variations I tried:

Setup:     15.610 secs: user  15.580 system   0.000 -- 500 million
Branches:   1.910 secs: user   1.910 system   0.000
Table 1:    1.840 secs: user   1.830 system   0.000
Table 5:    1.750 secs: user   1.750 system   0.000

at a whole (read it and sigh) 8% faster :-(.

Ah well, I was bored.

  • Wow, I didn't think about a table. I can locate the table in the same segment as the code, so I won't have to take an extra TLB hit. If the table is very small, it will likely stay in L2 cache the entire execution because of its high hit count. This might be a very good candidate for a table lookup. – johnnycrash Jul 17 '14 at 18:16
  • We don't have to match the original function exactly. I'm thinking we make the table 32 entries, and use the msb as the index. There is a single instruction for msb. That would get rid of the < 0x20000 cmp. There is a compare later on in the code that looks for an index of zero. I can change that to index > max recycle lists. – johnnycrash Jul 17 '14 at 18:23
  • Well... the `<= 0x20000` comparison stops it reading off the end of the `ts[]` table. Since "out-size" allocations are (one is assuming) rare as hens' teeth, the branch predictor will have it for breakfast. I did try the `__builtin_clz()`, but that ran slower at 2.4 secs. I have to say I think the improvement is so small that it will be hard to measure -- but if you want the greater fexibility of the table, then at least it doesn't make things worse. –  Jul 17 '14 at 20:46