Need multi-threading memory manager

Question

I will have to create a multi-threading project soon I have seen experiments ( delphitools.info/2011/10/13/memory-manager-investigations ) showing that the default Delphi memory manager has problems with multi-threading.

enter image description here

So, I have found this SynScaleMM. Anybody can give some feedback on it or on a similar memory manager?

Thanks

Could you please provide a citation for where you "keep hearing" that? You shouldn't make design decisions based on rumor and hearsay. — Rob Kennedy, May 20 '11 at 13:09
Which version of Delphi are you on? Have you moved to a Delphi based on a modern FastMM, or are you still on the old Borland MM? — David Heffernan, May 20 '11 at 13:11
@Altar That's BigBrain not SynScaleMM and those graphs relate to a very old version of Delphi using the Borland MM rather than FastMM. — David Heffernan, May 20 '11 at 13:31
Hi Rob, one of them is right on SynScaleMM web site. Also see this: digitaltundra.com/bigbrain.php — Gabriel, May 20 '11 at 13:36
I've heard people say that the world will end tomorrow; that doesn't make it so. As @Rob said, you shouldn't make major decisions (of which a memory manager is definitely one) based on something you "keep hearing". FastMM4 performs pretty well in multi-threaded apps unless you're doing something really intensive; if that were the case, you'd have specific reasons you wanted to change. — Ken White, May 20 '11 at 13:37
For example, I posted a bug in QC where FastMM can even deadlock under heavy concurrent MT. — , May 20 '11 at 14:51
I'll say this: good architecture is worth 100 times the benefit of swapping memory managers, unless the memory manager is very bad (and FastMM is pretty good). It you use messaging between threads then you reduce contention down to a level where this is just not a significant issue. If you need to swap memory managers (and unless you are seriously maxing out a machine with 10+ cores) then I suggest it the architecture thay needs changing, not the memory manager. — Misha, May 21 '11 at 02:55
@David: QC76832. It is still in the Need Feedback state because it happens in a complex application for which I can't submit the code. Had never time to build a test case. There's also a thread about it in FastMM sourceforge forum. — , May 21 '11 at 09:44
"and unless you are seriously maxing out a machine with 10+ cores" - - - That's my case. — Gabriel, May 30 '11 at 11:52
@RobKennedy - here is one example: https://www.delphitools.info/2011/10/13/memory-manager-investigations/ — Gabriel, Jun 01 '15 at 11:44

Arnaud Bouchez · Accepted Answer · 2013-12-05T20:44:49.207

Our SynScaleMM is still experimental.

EDIT: Take a look at the more stable ScaleMM2 and the brand new SAPMM. But my remarks below are still worth following: the less allocation you do, the better you scale!

But it worked as expected in a multi-threaded server environment. Scaling is much better than FastMM4, for some critical tests.

But the Memory Manager is perhaps not the bigger bottleneck in Multi-Threaded applications. FastMM4 could work well, if you don't stress it.

Here are some (not dogmatic, just from experiment and knowledge of low-level Delphi RTL) advice if you want to write FAST multi-threaded application in Delphi:

Always use const for string or dynamic array parameters like in MyFunc(const aString: String) to avoid allocating a temporary string per each call;
Avoid using string concatenation (s := s+'Blabla'+IntToStr(i)) , but rely on a buffered writing such as TStringBuilder available in latest versions of Delphi;
TStringBuilder is not perfect either: for instance, it will create a lot of temporary strings for appending some numerical data, and will use the awfully slow SysUtils.IntToStr() function when you add some integer value - I had to rewrite a lot of low-level functions to avoid most string allocation in our TTextWriter class as defined in SynCommons.pas;
Don't abuse on critical sections, let them be as small as possible, but rely on some atomic modifiers if you need some concurrent access - see e.g. InterlockedIncrement / InterlockedExchangeAdd;
InterlockedExchange (from SysUtils.pas) is a good way of updating a buffer or a shared object. You create an updated version of of some content in your thread, then you exchange a shared pointer to the data (e.g. a TObject instance) in one low-level CPU operation. It will notify the change to the other threads, with very good multi-thread scaling. You'll have to take care of the data integrity, but it works very well in practice.
Don't share data between threads, but rather make your own private copy or rely on some read-only buffers (the RCU pattern is the better for scaling);
Don't use indexed access to string characters, but rely on some optimized functions like PosEx() for instance;
Don't mix AnsiString/UnicodeString kind of variables/functions, and check the generated asm code via Alt-F2 to track any hidden unwanted conversion (e.g. call UStrFromPCharLen);
Rather use var parameters in a procedure instead of function returning a string (a function returning a string will add an UStrAsg/LStrAsg call which has a LOCK which will flush all CPU cores);
If you can, for your data or text parsing, use pointers and some static stack-allocated buffers instead of temporary strings or dynamic arrays;
Don't create a TMemoryStream each time you need one, but rely on a private instance in your class, already sized in enough memory, in which you will write data using Position to retrieve the end of data and not changing its Size (which will be the memory block allocated by the MM);
Limit the number of class instances you create: try to reuse the same instance, and if you can, use some record/object pointers on already allocated memory buffers, mapping the data without copying it into temporary memory;
Always use test-driven development, with dedicated multi-threaded test, trying to reach the worse-case limit (increase number of threads, data content, add some incoherent data, pause at random, try to stress network or disk access, benchmark with timing on real data...);
Never trust your instinct, but use accurate timing on real data and process.

I tried to follow those rules in our Open Source framework, and if you take a look at our code, you'll find out a lot of real-world sample code.

+1 Most of this list of good advice could be summarised as "don't use the heap if at all possible" — David Heffernan, May 20 '11 at 19:02
@David ... As you stated yourself in your answer! I just wanted to make it more clear, with precise workaround tricks and ideas. — Arnaud Bouchez, May 20 '11 at 19:03
Interesting advice, but some of it seems to be prioritising speed at the cost of maintainability (for example, functions returning strings is a much more "natural" way of writing code than procedures with var parameters). So I'd also add to the advice "Don't prematurely optimise". Only make some of these changes if you really need the speed. — Jonathan Morgan, May 23 '11 at 01:56
@Jonathan You're perfectly right: that was the reason of my last two advices (first benchmark and profile). But if you want your multi-threaded application to scale well with FastMM4 and the current reference count implementation (i.e. the asm LOCK), you'll have in all cases to get rid of (temporary) string allocation in loops. — Arnaud Bouchez, May 23 '11 at 05:22
How about adding to your list - always use ShortString with a defined max length when possible? (If you know your first name field is a max of 40 characters, why not use a 40-character ShortString instead of String?) — Darian Miller, Sep 28 '11 at 19:26
@Darian ShortString is an AnsiString, and will be converted into a plain `String` before using any method of the VCL. So you'll have even more memory allocations here. And since Delphi 2009, you'll loose Unicode capabilitieS. ShortString can be handy in some cases (for handling numerical data or code-level identifiers), but you will have to use only ShortString methods to avoid all those hidden conversions to `string`. So IMHO this is not a general rule to advice here - it may slow down your app. — Arnaud Bouchez, Sep 29 '11 at 05:38
@DarianMiller Yes it is still experimental, since we do not use it on production. In mORMot, we tried to avoid most memory allocation/reallocation, so FastMM4 is enough [to have very good scaling performance](http://robertocschneiders.wordpress.com/2012/11/22/datasnap-analysis-based-on-speed-stability-tests/), without the overhead of a per-thread heap like SynScaleMM. Bu we found it to be useful and stable enough to be included in some existing projects where memory allocations are highly used. — Arnaud Bouchez, Nov 27 '12 at 09:23

David Heffernan · Answer 2 · 2011-05-20T15:08:23.050

If your app can accommodate GPL licensed code, then I'd recommend Hoard. You'll have to write your own wrapper to it but that is very easy. In my tests, I found nothing that matched this code. If your code cannot accommodate the GPL then you can obtain a commercial licence of Hoard, for a significant fee.

Even if you can't use Hoard in an external release of your code you could compare its performance with that of FastMM to determine whether or not your app has problems with heap allocation scalability.

I have also found that the memory allocators in the versions of msvcrt.dll distributed with Windows Vista and later scale quite well under thread contention, certainly much better than FastMM does. I use these routines via the following Delphi MM.

unit msvcrtMM;

interface

implementation

type
  size_t = Cardinal;

const
  msvcrtDLL = 'msvcrt.dll';

function malloc(Size: size_t): Pointer; cdecl; external msvcrtDLL;
function realloc(P: Pointer; Size: size_t): Pointer; cdecl; external msvcrtDLL;
procedure free(P: Pointer); cdecl; external msvcrtDLL;

function GetMem(Size: Integer): Pointer;
begin
  Result := malloc(size);
end;

function FreeMem(P: Pointer): Integer;
begin
  free(P);
  Result := 0;
end;

function ReallocMem(P: Pointer; Size: Integer): Pointer;
begin
  Result := realloc(P, Size);
end;

function AllocMem(Size: Cardinal): Pointer;
begin
  Result := GetMem(Size);
  if Assigned(Result) then begin
    FillChar(Result^, Size, 0);
  end;
end;

function RegisterUnregisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
  Result := False;
end;

const
  MemoryManager: TMemoryManagerEx = (
    GetMem: GetMem;
    FreeMem: FreeMem;
    ReallocMem: ReallocMem;
    AllocMem: AllocMem;
    RegisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak;
    UnregisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak
  );

initialization
  SetMemoryManager(MemoryManager);

end.

It is worth pointing out that your app has to be hammering the heap allocator quite hard before thread contention in FastMM becomes a hindrance to performance. Typically in my experience this happens when your app does a lot of string processing.

My main piece of advice for anyone suffering from thread contention on heap allocation is to re-work the code to avoid hitting the heap. Not only do you avoid the contention, but you also avoid the expense of heap allocation – a classic twofer!

Hoard offers a commercial license only. Not cheap, but it allows for non GPL applications. — , May 20 '11 at 14:56
@Idsandon I've updated the question to expand on Hoard's licence. — David Heffernan, May 20 '11 at 15:08
@mca64 I recommend whatever is best for your application. MM perf is very much app dependent. Try the possible options, and see what works best for you. — David Heffernan, Oct 05 '14 at 17:24
@DavidHeffernan From my testing ScaleMM2 is 4x faster than TBB and MSVCRT. Multi threaded string concatenation. — user15124, Nov 04 '15 at 17:10

Maxim Masiutin · Answer 3 · 2017-07-15T18:51:28.913

It is locking that makes the difference!

There are two issues to be aware of:

Use of the LOCK prefix by the Delphi itself (System.dcu);
How does FastMM4 handles thread contention and what it does after it failed to acquire a lock.

Use of the `LOCK` prefix by the Delphi itself

Borland Delphi 5, released in 1999, was the one that introduced the lock prefix in string operations. As you know, when you assign one string to another, it does not copy the whole string but merely increases the reference counter inside the string. If you modify the string, it is de-references, decreasing the reference counter and allocating separate space for the modified string.

In Delphi 4 and earlier, the operations to increase and decrease the reference counter were normal memory operations. The programmers that have used Delphi knew about and, and, if they were using strings across threads, i.e. pass a string from one thread to another, have used their own locking mechanism only for the relevant strings. Programmers did also use read-only string copy that did not modify in any way the source string and did not require locking, for example:

function AssignStringThreadSafe(const Src: string): string;
var
  L: Integer;
begin
  L := Length(Src);
  if L <= 0 then Result := '' else
  begin
    SetString(Result, nil, L);
    Move(PChar(Src)^, PChar(Result)^, L*SizeOf(Src[1]));
  end;
end;

But in Delphi 5, Borland have added the LOCK prefix to the string operations and they became very slow, compared to Delphi 4, even for single-threaded applications.

To overcome this slowness, programmers became to use "single threaded" SYSTEM.PAS patch files with lock's commented.

Please see https://synopse.info/forum/viewtopic.php?id=57&p=1 for more information.

FastMM4 Thread Contention

You can modify FastMM4 source code for a better locking mechanism, or use any existing FastMM4 fork, for example https://github.com/maximmasiutin/FastMM4

FastMM4 is not the fastest one for multicore operation, especially when the number of threads is more than the number of physical sockets is because it, by default, on thread contention (i.e. when one thread cannot acquire access to data, locked by another thread) calls Windows API function Sleep(0), and then, if the lock is still not available enters a loop by calling Sleep(1) after each check of the lock.

Each call to Sleep(0) experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles. As about Sleep(1) – besides the costs associated with Sleep(0) – it also delays execution by at least 1 millisecond, ceding control to other threads, and, if there are no threads waiting to be executed by a physical CPU core, puts the core into sleep, effectively reducing CPU usage and power consumption.

That’s why, on multithreded wotk with FastMM, CPU use never reached 100% - because of the Sleep(1) issued by FastMM4. This way of acquiring locks is not optimal. A better way would have been a spin-lock of about 5000 pause instructions, and, if the lock was still busy, calling SwitchToThread() API call. If pause is not available (on very old processors with no SSE2 support) or SwitchToThread() API call was not available (on very old Windows versions, prior to Windows 2000), the best solution would be to utilize EnterCriticalSection/LeaveCriticalSection, that don’t have latency associated by Sleep(1), and which also very effectively cedes control of the CPU core to other threads.

The fork that I've mentioned uses a new approach to waiting for a lock, recommended by Intel in its Optimization Manual for developers - a spinloop of pause + SwitchToThread(), and, if any of these are not available: CriticalSections instead of Sleep(). With these options, the Sleep() will never be used but EnterCriticalSection/LeaveCriticalSection will be used instead. Testing has shown that the approach of using CriticalSections instead of Sleep (which was used by default before in FastMM4) provides significant gain in situations when the number of threads working with the memory manager is the same or higher than the number of physical cores. The gain is even more evident on computers with multiple physical CPUs and Non-Uniform Memory Access (NUMA). I have implemented compile-time options to take away the original FastMM4 approach of using Sleep(InitialSleepTime) and then Sleep(AdditionalSleepTime) (or Sleep(0) and Sleep(1)) and replace them with EnterCriticalSection/LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed (reduce latency) that was affected each time by at least 1 millisecond by Sleep(1), because the Critical Sections are much more CPU-friendly and have definitely lower latency than Sleep(1).

When these options are enabled, FastMM4-AVX it checks: (1) whether the CPU supports SSE2 and thus the "pause" instruction, and (2) whether the operating system has the SwitchToThread() API call, and, if both conditions are met, uses "pause" spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn't have the "pause" instrcution or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.

You can see the test results, including made on a computer with multiple physical CPUs (sockets) in that fork.

See also the Long Duration Spin-wait Loops on Hyper-Threading Technology Enabled Intel Processors article. Here is what Intel writes about this issue - and it applies to FastMM4 very well:

The long duration spin-wait loop in this threading model seldom causes a performance problem on conventional multiprocessor systems. But it may introduce a severe penalty on a system with Hyper-Threading Technology because processor resources can be consumed by the master thread while it is waiting on the worker threads. Sleep(0) in the loop may suspend the execution of the master thread, but only when all available processors have been taken by worker threads during the entire waiting period. This condition requires all worker threads to complete their work at the same time. In other words, the workloads assigned to worker threads must be balanced. If one of the worker threads completes its work sooner than others and releases the processor, the master thread can still run on one processor.

On a conventional multiprocessor system this doesn't cause performance problems because no other thread uses the processor. But on a system with Hyper-Threading Technology the processor the master thread runs on is a logical one that shares processor resources with one of the other worker threads.

The nature of many applications makes it difficult to guarantee that workloads assigned to worker threads are balanced. A multithreaded 3D application, for example, may assign the tasks for transformation of a block of vertices from world coordinates to viewing coordinates to a team of worker threads. The amount of work for a worker thread is determined not only by the number of vertices but also by the clipped status of the vertex, which is not predictable when the master thread divides the workload for working threads.

A non-zero argument in the Sleep function forces the waiting thread to sleep N milliseconds, regardless of the processor availability. It may effectively block the waiting thread from consuming processor resources if the waiting period is set properly. But if the waiting period is unpredictable from workload to workload, then a large value of N may make the waiting thread sleep too long, and a smaller value of N may cause it to wake up too quickly.

Therefore the preferred solution to avoid wasting processor resources in a long duration spin-wait loop is to replace the loop with an operating system thread-blocking API, such as the Microsoft Windows* threading API, WaitForMultipleObjects. This call causes the operating system to block the waiting thread from consuming processor resources.

It refers to Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon Processor application note.

You can also find a very good spin-loop implementation here at stackoverflow.

It also loads normal loads just to check before issuing a lock-ed store, just to not flood the CPU with locked operations in a loop, that would lock the bus.

FastMM4 per se is very good. Just improve the locking and you will get an excelling multi-threaded memory manager.

Please also be aware that each small block type is locked separately in FastMM4.

You can put padding between the small block control areas, to make each area have own cache line, not shared with other block sizes, and to make sure it begins at a cache line size boundary. You can use CPUID to determine the size of the CPU cache line.

So, with locking correctly implemented to suit your needs (i.e. whether you need NUMA or not, whether to use lock-ing releases, etc., you may obtain the results that the memory allocation routines would be several times faster and would not suffer so severely from thread contention.

score 2 · Answer 4 · answered May 20 '11 at 13:21

2

FastMM deals with multi-threading just fine. It is the default memory manager for Delphi 2006 and up.

If you are using an older version of Delphi (Delphi 5 and up), you can still use FastMM. It's available on SourceForge.

answered May 20 '11 at 13:21

Bruce McGee

15,076
6
55
70

2

In fact, FastMM scales rather poorly under heavy thread contention. – David Heffernan May 20 '11 at 13:22
Only under a specific type of work load? I'd be curious to see how all alternatives compare in the FastCode memory manager challenge. – Bruce McGee May 20 '11 at 13:26
1

@Bruce Under any workload as far as I know. FastMM is simply not scalable. That's why you find the existence of lots of scalable MMs. – David Heffernan May 20 '11 at 13:29
1

FWIW, I have an app doing lots of transfers and using about 400 threads, and FastMM handles it fine. I think that using a known and well tested MM is more important than a potential improvement and potential pitfalls of a fault when multi-threading. – mj2008 May 20 '11 at 13:38
@mj2008 Looks like you don't have significant contention on your heap allocation! – David Heffernan May 20 '11 at 13:40
FastMM only "scales" (a bit) with random sizes. With similar sizes only core is used: I have a quad core here, doesn't get above 25% :-( – André May 20 '11 at 13:52
Unless it has 400 cores it may not see much contention - some issue may arise only when threads are run actually concurrently, not while they wait to be executed. – May 20 '11 at 14:54
I've also had good luck with FastMM in many (probably too many) threads. This isn't a religious issue. If FastMM falls down or is otherwise unsuitable on multiple threads there should be some kind of clear evidence. Any links would be appreciated. – Bruce McGee May 20 '11 at 15:14
@Bruce see the benchmarking here for example: http://blog.synopse.info/post/2010/12/04/SynScaleMM – David Heffernan May 20 '11 at 15:48
@David, Looks like a selective benchmark to me. I'd need something a little more comprehensive to be convinced. Anyone have any interest in resurrecting the FastCode memory manager challenge? – Bruce McGee May 20 '11 at 19:24
@Bruce I'm sure that benchmark is very artificial. Switching from FastMM to msvcrt.dll gave my app a 5-10% boost in typical usage on an 8 core machine. But I gained more by giving up heap allocation. – David Heffernan May 20 '11 at 19:27
@David, What was your work load? It might be worth adding to the memory manager tests. – Bruce McGee May 21 '11 at 01:01
1

I use FastMM for highly multi-threaded servers (with 100+ concurrent threads) on dual quad and hex core machines. One particular server has been up for 6 months and processed over 1,000,000,000 internal messages. I have never had any issues with FastMM over the 5 years I have used it, so you woud need a pretty compelling reason to switch. – Misha May 21 '11 at 02:48
1

I'm with Bruce here. Lots of talk about small performance gains (5%, etc), which given that a newer processor or more cores can boost performance by multiples just seems to me to be a waste of time. Similarly, architectural changes can boost performance by way more than you will get by changing memory managers. – Misha May 21 '11 at 03:00
@misha @bruce my app is cpu bound and performs time domain finite element analysis. In multi threaded mode it runs multiple independent simulations in parallel. So in theory I felt I should see 100% cpu utilisation. I did not see that and traced the problem to heap allocation thread contention. Reworking the code to shun the heap gave much improvement. But changing allocator also yielded gains that were significant to me. 30 lines of code for single digit performance gain is a great deal. – David Heffernan May 21 '11 at 07:04
@misha @bruce I realise that I am in a minority in caring so strongly about performance but I don't see why you can't appreciate that matters that aren't critical to you could be critical to others. – David Heffernan May 21 '11 at 07:06
@Misha: you server is processing (if using 100 thread on 8 cores) an average of 0.64 messages per seconds per thread. Also, MM performance impact depends on how memory intensive thread processing is. My application which looks similar to David's, starts to have issue around one thread per core, adn it uses memory allocations a lot. – May 21 '11 at 10:00
@David: yes, your application is at the extreme end of multi-threading, and in particular, uses lots of memory allocations. If you go back and read the original question, then FastMM is perfectly adequate for the OP. Optimisations are good, but in the context of the original question not really relevant in this case. – Misha May 21 '11 at 10:58
@misha My app isn't that heavy on the heap, and never has been. The key is that it is CPU bound. OP gives no details of his app so we can only guess whether this would be an issue. I made those points in my answer. Finally I have no objection to other people choosing whichever memory manager is most suited to their needs. As it happens I use FastMM for debugging and development but msvcrt.dll for production. – David Heffernan May 21 '11 at 11:05
@Idsandon: threads are used sporadically and in bursts. There is no such thing as an average number of messages per second processed by a thread. I log thread CPU usage every 10 minutes and on occasions I have has a single thread hit almost 12.5% on a 8-core machine - which is the equivalent of the thread consuming 100% of a single core. – Misha May 21 '11 at 11:07
@David: all my servers are CPU bound - it is the nature of the processing that I do. I suspect that the granularity of my parallelism is much larger than yours, and possibly the reason why the optimisations that are critical to your software just aren't relevant to mine. A typical chunk of work in my threads (processing a message) is around 1-10 ms, so there tends to be a lot of processing as oppposed to memory allocation. – Misha May 21 '11 at 11:12

score 0 · Answer 5 · edited May 20 '11 at 14:33

0

You could use TopMM: http://www.topsoftwaresite.nl/

You could also try ScaleMM2 (SynScaleMM is based on ScaleMM1) but I have to fix a bug regarding to interthread memory, so not production ready yet :-( http://code.google.com/p/scalemm/

edited May 20 '11 at 14:33

Arnaud Bouchez

42,305
3
71
159

answered May 20 '11 at 13:56

André

8,920
1
24
24

score -1 · Answer 6 · answered May 20 '11 at 13:07

-1

Deplhi 6 memory manager is outdated and outright bad. We were using RecyclerMM both on a high-load production server and on a multi-threaded desktop application, and we had no issues with it: it's fast, reliable and doesn't cause excess fragmentation. (Fragmentation was Delphi's stock memory manager worst issue).

The only drawback of RecyclerMM is that it isn't compatible with MemCheck out of the box. However, a small source alteration was enough to render it compatible.

answered May 20 '11 at 13:07

Zruty

8,377
1
25
31

1

Where does Delphi 6 come into this question, OP is using XE? And who uses MemCheck anymore? And I can't even find RecyclerMM - is it still alive? – David Heffernan May 20 '11 at 19:31
RecyclerMM is only good compared to the Delphi 6 defaults. FastMM is way better than that, and you can use FastMM on any delphi version from 6 to the latest. – Warren P May 21 '11 at 17:03

Need multi-threading memory manager

6 Answers6

Use of the `LOCK` prefix by the Delphi itself

FastMM4 Thread Contention

Linked

Need multi-threading memory manager

6 Answers6

Use of the LOCK prefix by the Delphi itself

FastMM4 Thread Contention

Linked

Use of the `LOCK` prefix by the Delphi itself