Why does removing class member make code significantly slower

Question

I have a class which is frequently copied in a performance critical section of code. The class has a string member variable which may be set multiple times during the an object's lifetime but ultimately isn't used for anything meaningful. For obvious reasons I would like to remove this variable.

class MyClass {
public:
  //Constructor
  ...
  //Methods
  ...
  //Properties
  double usefulMember;
  int usefulMember2;
  std::string uselessString;
};

However, I have found that if I remove uselessString then my code runs ~15% slower. Can somebody explain why this would be? If anything I would think that getting rid of the need to copy this string over and over would improve performance.

One thing I noticed is that with the uselessString variable sizeof(MyClass) is 448 (divisible by 64). Could this have to do with how MyClass aligns with the CPU cache? I tried declaring the class with alignas(64) but it didn't help.

Edit: I ended up using the Intel TBB malloc_proxy to implement scalable memory allocation across my whole program. I don't know exactly what the issue was before but things are now running faster than ever before.

Are you timing a release mode / optimized mode executable and not using debug mode? — drescherjm, Mar 22 '22 at 19:03
`Could this have to do with how MyClass aligns with the CPU cache?` Quite possibly. Incidental changes in memory layout can have huge effect on performance. (I recommend watching this talk, for anyone who intends to do any benchmarking: https://www.youtube.com/watch?v=koTf7u0v41o) — eerorika, Mar 22 '22 at 19:04
@drescherjm, yes, I am looking at the timing for an executable built in release mode. — nickexists, Mar 22 '22 at 19:05
Does your program use multithreading? If the answer is yes, then it is possible that reducing the size of `MyClass` will compact the memory in such a way that this object now shares a [cache line](https://en.wikipedia.org/wiki/CPU_cache#CACHE-LINES) with data that that is accessed by another thread. See [false sharing](https://en.wikipedia.org/wiki/False_sharing) for more information. To prevent false sharing, you should ensure that two threads do not access the same cache line, unless both threads are read-only. — Andreas Wenzel, Mar 22 '22 at 19:21
Another possibility is that reducing the size of `MyClass` will trigger a [critical stride problem](https://stackoverflow.com/q/12264970/12149471), whereas the increased size of `MyClass` prevents this problem from occurring. Without a [mre], I am unable to tell whether this could be a problem. — Andreas Wenzel, Mar 22 '22 at 19:30
@AndreasWenzel, Yes, the code is multithreaded (using tbb::parallel_reduce). However, there shouldn't be any data shared between threads, each thread has it's own local std::vector. Is it still possible for fals sharing to occur in this case? — nickexists, Mar 22 '22 at 19:35
@nickexists: In your previous comment, you seem to be talking about true sharing. However, I am talking about [false sharing](https://en.wikipedia.org/wiki/False_sharing), which happens if no actual data is shared, but if the data accessed by two separate threads resides on the same cache line. The typical size of a cache line is 64 bytes. A container of `std::vector` has data in the class itself, and also external dynamically allocated data. It is possible for either parts of these memory to share a cache line with another thread, especially if both `std::vector` use the same allocator. — Andreas Wenzel, Mar 22 '22 at 19:46
@nickexists: I suggest that you inspect the memory address of the `std::vector` objects themselves (by using the `&` address-of operator) and also the addresses of the external data (by using [`std::vector::data()`](https://en.cppreference.com/w/cpp/container/vector/data)). If these memory addresses from the different threads are so close to each other that it is possible that two threads are accessing the same 64-byte cache line, then it is possible that your problem is caused by false sharing. — Andreas Wenzel, Mar 22 '22 at 19:49
@AndreasWenzel, Thanks so much for the help. I tested the timing when the code is run in a single thread and saw the expected behavior, the version that doesn't have a `std::string` runs faster so I think you are probably right that this is an issue with false sharing. Now the question becomes how can I prevent this. I've tried using `class alignas(64) MyClass` and I've also tried using the aligned allocator [here](https://stackoverflow.com/questions/60169819/modern-approach-to-making-stdvector-allocate-aligned-memory) for my `vector` but it hasn't helped. — nickexists, Mar 22 '22 at 22:28
@nickexists: I suggest that you take a look at the memory addresses as described in my previous comment. This should allow you to calculate whether there is indeed false sharing occuring. As pointed out in one of my previous comments, it could also be [a critical stride problem](https://stackoverflow.com/q/12264970/12149471). Or are there maybe other variables besides the `std::vector` that are being truly or falsely shared between multiple threads with at least one thread writing? (Read-only shared data is not a problem for performance.) — Andreas Wenzel, Mar 23 '22 at 00:06
It could also be a completely different problem, which has nothing to do with false sharing or critical stride, which just happens to be triggered when you reduce the size of the class. For this reason, it would be ideal if you could provide a [mre] of the problem. This would also allow your question to be reopened. — Andreas Wenzel, Mar 23 '22 at 00:12
@nickexists: False sharing will not occur for objects on the [stack](https://en.wikipedia.org/wiki/Call_stack) of an individual thread (i.e. objects with automatic [storage duration](https://en.cppreference.com/w/cpp/language/storage_duration)), unless that thread gives another thread access to its stack, for example by passing a pointer to an objects on its own stack to another thread. That is because each thread's stack is in a completely different region of a process' [virtual address space](https://en.wikipedia.org/wiki/Virtual_address_space). — Andreas Wenzel, Mar 23 '22 at 00:47
@nickexists: Even if both the stack memory and the heap memory that belongs to the `std::vector` objects is aligned to 64 bytes, it is possible that this memory is sharing cache lines with other data that is not aligned to 64 bytes, which is accessed by a different thread. This can cause false sharing. Therefore, it is important to inspect the addresses of all variables that may reside in cache lines accessed by several threads, unless you are sure that these accesses are strictly read-only for all threads. — Andreas Wenzel, Mar 23 '22 at 01:17

Why does removing class member make code significantly slower

0 Answers0