Impact of the prior loop iteration on the execution time of the current iteration

Question

I am trying to measure the performance of concurrent insertion in folly hashmap. A simplified version of a program for such insertion is brought here:

#include <folly/concurrency/ConcurrentHashMap.h>
#include <chrono>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>

const int kNumMutexLocks = 2003;
std::unique_ptr<std::mutex[]> mutices(new std::mutex[kNumMutexLocks]);
__inline__ void
concurrentInsertion(unsigned int threadId, unsigned int numInsertionsPerThread,
                    unsigned int numInsertions, unsigned int numUniqueKeys,
                    folly::ConcurrentHashMap<int, int> &follyMap) {
  int base = threadId * numInsertionsPerThread;
  for (int i = 0; i < numInsertionsPerThread; i++) {
    int idx = base + i;
    if (idx >= numInsertions)
      break;
    int val = idx;
    int key = val % numUniqueKeys;
    mutices[key % kNumMutexLocks].lock();
    auto found = follyMap.find(key);
    if (found != follyMap.end()) {
      int oldVal = found->second;
      if (oldVal < val) {
        follyMap.assign(key, val);
      }
    } else {
      follyMap.insert(key, val);
    }
    mutices[key % kNumMutexLocks].unlock();
  }
}

void func(unsigned int numInsertions, float keyValRatio) {
  const unsigned int numThreads = 12; // Simplified just for this post
  unsigned int numUniqueKeys = numInsertions * keyValRatio;
  unsigned int numInsertionsPerThread = ceil(numInsertions * 1.0 / numThreads);
  std::vector<std::thread> insertionThreads;
  insertionThreads.reserve(numThreads);
  folly::ConcurrentHashMap<int, int> follyMap;

  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < numThreads; i++) {
    insertionThreads.emplace_back(std::thread([&, i] {
      concurrentInsertion(i, numInsertionsPerThread, numInsertions,
                          numUniqueKeys, follyMap);
    }));
  }
  for (int i = 0; i < numThreads; i++) {
    insertionThreads[i].join();
  }
  auto end = std::chrono::steady_clock::now();

  auto diff = end - start;
  float insertionTimeMs =
      std::chrono::duration<double, std::milli>(diff).count();
  std::cout << "i: " << numInsertions << "\tj: " << keyValRatio
            << "\ttime: " << insertionTimeMs << std::endl;
}

int main() {
  std::vector<float> js = {0.5, 0.25};
  for (auto j : js) {
    std::cout << "-------------" << std::endl;
    for (int i = 2048; i < 4194304 * 8; i *= 2) {
      func(i, j);
    }
  }  
}

The problem is that using this loop in main, suddenly increases the measured time in the func function. That is, if I call the function directly from main without any loop (as shown in what follows), the measure time for some cases is suddenly more than 100X smaller.

int main() {
  func(2048, 0.25); // ~ 100X faster now that the loop is gone.
}

Possible Reasons

I allocate a huge amount of memory while building the hasmap. I believe when I run the code in a loop, while the second iteration of loop being executed the computer is busy freeing the memory for the first iteration. Hence, the program becomes much slower. If this is the case, I'd be grateful if someone can suggest a change that I can get the same results with loop.

More Details

Please note that if I unroll the loop in main, I have the same issue. That is, the following program has the same problem:

int main() {
  performComputation(input A);
  ...
  performComputation(input Z);
}

Sample Output

The output of the first program is shown here:

i: 2048     j: 0.5  time: 1.39932
...
i: 16777216 j: 0.5  time: 3704.33
------------- 
i: 2048     j: 0.25 time: 277.427 <= sudden increase in execution time
i: 4096     j: 0.25 time: 157.236
i: 8192     j: 0.25 time: 50.7963
i: 16384    j: 0.25 time: 133.151
i: 32768    j: 0.25 time: 8.75953
...
i: 2048     j: 0.25 time: 162.663

Running the func alone in main with i=2048 and j=0.25 yields:

i: 2048     j: 0.25 time: 1.01

Any comment/insight is highly appreciated.

It's unclear to me what you're asking. Are you asking why, or maybe for a better solution? — , Jul 22 '19 at 02:29
Thank you @Chipster for your comment. I am asking why and how can I find a solution. — MTMD, Jul 22 '19 at 02:30
There are many possibilities, like something related to cache locality, or something along the lines of what you were thinking -- something getting triggered in the operating system related to allocation/deallocation of large memory regions. Unlikely that someone can diagnose this remotely. I'd start by removing/skipping chunks of what's happening in "B", and then comparing single vs multiple invocations, and see if removing something in there makes the difference in execution times go away. — Sam Varshavchik, Jul 22 '19 at 02:41
Thank you @SamVarshavchik for your comment. Probably not a cache issue since I have very random access to a large memory footprint. I was hoping someone suggests a function call that can force the program free all the resources after each loop iteration. Your suggestion is great. I have been doing that and will continue to work on it. Thank you. — MTMD, Jul 22 '19 at 02:47
`delete` and `free` deletes any allocated memory. And vector's destructor will do that, since you mentioned that you're allocating memory via a vector. There's nothing else that can be done, on this subject matter. — Sam Varshavchik, Jul 22 '19 at 02:49
@Yakk-AdamNevraumont that's true. Even though it'll take quite some time to generate a minimal reproducible example, I will create one and then will update the question accordingly. -- Thank you. — MTMD, Jul 22 '19 at 03:33
I updated the post to include a full but simplified (to the extent possible) version of the program. — MTMD, Jul 22 '19 at 20:00
@1201ProgramAlarm. Sure. I updated the post to include the `#include` directives (at the very bottom of the post). — MTMD, Jul 24 '19 at 21:31
@MTMD Just a thought, Why do you use mutexes to guard the map, isn't the map supposed to be thread safe? — Bar Stool, Jul 25 '19 at 06:26
@BarStool. Surprisingly, the map doesn't provide any synchronization. I confirmed this on their github page here: [https://github.com/facebook/folly/issues/1185] — MTMD, Jul 25 '19 at 19:06

score 3 · Answer 1 · answered Jul 22 '19 at 06:58

Iff it is the memory allocation that is slowing it down and the contents of the memory before performComputation(input) is irrelevant you could just re-use the allocated memory block.

int performComputation(input, std::vector<char>& memory) { 

  /* Note: memory will need to be passed by reference*/

  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < numThreads; i++) {
    t.emplace_back(std::thread([&, i] {
      func(...); // Random access to memory
    }));
  }

  for (int i = 0; i < numThreads; i++) {
    t[i].join();
  }

  auto end = std::chrono::steady_clock::now();
  float time = std::chrono::duration<double, std::milli>(end - start).count();

}

int main() {

  // A. Allocate ~1GB memory here
  std::vector<char> memory(1028 * 1028 * 1028) //is that 1 gig?

  for (input: inputs)
    performComputation(input, memory);
}

Thank you bar for your comment. I tried allocating memory in advance however that didn't resolve the non-deterministic output. I updated the post accordingly and included a full version of the program. Thank you again for your response. If you have time, please take a look at the revised post. — MTMD, Jul 22 '19 at 20:02

John Ilacqua · Accepted Answer · 2019-07-31T02:48:38.440

I can't be too confident on the exact details, but it seems to me to be a result of memory allocation in building the map. I replicated the behaviour you're seeing using a plain unordered_map and a single mutex, and making the map object in func static fixed it entirely. (Actually now it's slightly slower the first time around, since no memory has been allocated for the map yet, and then faster and a consistent time every subsequent run.)

I'm not sure why this makes a difference, since the map has been destructed and the memory should have been freed. For some reason it seems the map's freed memory isn't reused on subsequent calls to func. Perhaps someone else more knowledgeable than I can elaborate on this.

Edit: reduced minimal, reproducible example and output

void func(int num_insertions)
{
    const auto start = std::chrono::steady_clock::now();

    std::unordered_map<int, int> map;
    for (int i = 0; i < num_insertions; ++i)
    {
        map.emplace(i, i);
    }

    const auto end = std::chrono::steady_clock::now();
    const auto diff = end - start;

    const auto time = std::chrono::duration<double, std::milli>(diff).count();
    std::cout << "i: " << num_insertions << "\ttime: " << time << "\n";
}

int main()
{
    func(2048);
    func(16777216);
    func(2048);
}

With non-static map:

i: 2048 time: 0.6035
i: 16777216     time: 4629.03
i: 2048 time: 124.44

With static map:

i: 2048 time: 0.6524
i: 16777216     time: 4828.6
i: 2048 time: 0.3802

Another edit: should also mention that the static version also requires a call to map.clear() at the end, though that's not really relevant to the question of the performance of the insertions.

score 1 · Answer 3 · answered Jul 30 '19 at 13:04

When measuring wall clock time use averages !

You are measuring wall clock time. The actual time jumps seen is somewhat in the small range in this regard and could in theory be caused OS delays or other processing or perhaps it may be worse due to thread management(eg. cleanup) caused by your program (note this can vary a lot depending on platform/system and remember that a context switch can easily take ~10-15ms) There are just too many paramters in play to be sure. When using wall clock to measure, it is a common practice to averaged over a loop of some hundreds or thousands of times to takes spikes/etc... into account

Use a profiler

Learn to use a profiler - a profiler can help you to quickly see what your program is actually spending time on and save preciouse time again and again.

Thank you for your response. That's in essence what I am curious about; is the time being spend on memory management, thread scheduling, context switching or other housekeeping activities (which one). The time is small in the above example, however, there are cases with near 1s of delay. Using a profiler is perhaps the best idea. Thank you for this suggestion. — MTMD, Jul 30 '19 at 19:21
@MTMD The bottom line is that it is very hard to 'guess' correctly in these types of problems and not rarely are we surprised to learn where the bottleneck is. — darune, Jul 31 '19 at 07:59

Impact of the prior loop iteration on the execution time of the current iteration

3 Answers3

When measuring wall clock time use averages !

Use a profiler