19

I am trying out the Parallel Programming features of Delphi XE7 Update 1.

I created a simple TParallel.For loop that basically does some bogus operations to pass the time.

I launched the program on a 36 vCPU at an AWS instance (c4.8xlarge) to try to see what the gain of Parallel Programming could be.

When I first launch the program and execute the TParallel.For loop, I see a significant gain (although admitelly a lot less than I anticipated with 36 vCPUs):

Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms

If I do not close the program and run the pass again on the 36 vCPU machine shortly after (for example, immediately or some 10-20 seconds later), the Parallel pass worsens a lot:

Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms

If I don't close the program and I wait a few minutes (not a few seconds, but a few minutes) before running the pass again, I get again the results I get when first launching the program (10x improvement in response time).

The very first pass right after launching the program is always faster on the 36 vCPUs machine, so it seems that this effect only happens the second time a TParallel.For is called in the program.

This is the sample code I'm running:

unit ParallelTests;

interface

uses
  Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
  System.Threading, System.SyncObjs, System.Diagnostics,
  Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    SingleThreadCheckBox: TCheckBox;
    ParallelCheckBox: TCheckBox;
    UnitsEdit: TEdit;
    Label1: TLabel;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.Button1Click(Sender: TObject);
var
  matches: integer;
  i,j: integer;
  sw: TStopWatch;
  maxItems: integer;
  referenceStr: string;

 begin
  sw := TStopWatch.Create;

  maxItems := 5000;

  Randomize;
  SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26)); 

  if ParallelCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    TParallel.For(1, MaxItems,
      procedure (Value: Integer)
        var
          index: integer;
          found: integer;
        begin
          found := 0;
          for index := 1 to length(referenceStr) do begin
            if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
              inc(found);
            end;
          end;
          TInterlocked.Add(matches, found);
        end);
    sw.Stop;
    Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;

  if SingleThreadCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    for i := 1 to MaxItems do begin
      for j := 1 to length(referenceStr) do begin
        if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
          inc(matches);
        end;
      end;
    end;
    sw.Stop;
    Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;
end;

end.

Is this working as designed? I found this article (http://delphiaball.co.uk/tag/parallel-programming/) recommending that I let the library decide the thread pool, but I do not see the point of using Parallel Programming if I have to wait minutes from request to request so that the request is served faster.

Am I missing anything on how a TParallel.For loop is supposed to be used?

Please note that I cannot reproduce this on a AWS m3.large instance (2 vCPU according to AWS). In that instance, I always get a slight improvement, and I do not get a worse result in subsequent calls of TParallel.For shortly after.

Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms

So it seems that this effect occurs when there are many cores available (36), which is a pity because the whole point of Parallel Programming is to benefit from many cores. I wonder if this is a library bug because of the high count of cores or the fact that the core count is not a power of 2 in this case.

UPDATE: After testing it with various instances of different vCPU counts in AWS, this seems to be the behaviour:

  • 36 vCPUs (c4.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
  • 32 vCPUs (c3.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
  • 16 vCPUs (c3.4xlarge). You have to wait sub second times. It could be usable if load is low but response time still important
  • 8 vCPUs (c3.2xlarge). It seems to work normally
  • 4 vCPUs (c3.xlarge). It seems to work normally
  • 2 vCPUs (m3.large). It seems to work normally
J...
  • 30,968
  • 6
  • 66
  • 143
Pep
  • 1,957
  • 2
  • 24
  • 40
  • @Pep If you think that the library is an issue, write the code using another library and compare. I doubt that the library is the issue. – David Heffernan Mar 15 '15 at 19:39
  • I tested it a little bit more. It seems that, at least with AWS, the Parallel library has some issues when vCPUs > 8. With vCPU = 16 it works a lot better than with vCPU = 32 or 36, but it still has issues. Probably the TParallel.For call has been finetuned for system up to 8 virtual cores (desktop machines). I will update the question with my findings. – Pep Mar 15 '15 at 19:45
  • That's not likely at all. I don't know why you would think that. Don't guess. Run equivalent code under OTL and see what happens. – David Heffernan Mar 15 '15 at 19:47
  • Its just black box testing with the code in the question on AWS. Just published the behaviour I got with Delphi XE7 Update 1 on the mentioned AWS instance types. – Pep Mar 15 '15 at 19:57
  • 1
    So, I don't think that the library will have been optimised for a certain number of cores. But I think that the library is likely the root of the problem. A comparison with OTL would give a good idea if that was the case. What I must say though is that the new RTL parallel library is utter rubbish. There have been countless posts here exposing it as shockingly badly implemented. I doubt that I could ever bring myself to use it. I commend OTL to you. – David Heffernan Mar 15 '15 at 20:00
  • FWIW, I can observe the issue you report on my 16 way machine. I get reasonable scaling the first run. The ratio is around 7, but it's really an 8 way machine with hyper threading. So whilst there are 16 logical processors, you seldom see better than 8x scaling. Intel marketing department are kidding us all. Then the next run, the parallel perf is shocking. So I guess something is broken in the library. An OTL variant would help show that. That's really the next step. – David Heffernan Mar 15 '15 at 20:02
  • 4
    I think that it's pretty clear that at some point you are hitting a bug in the parallel library that results in the code executing serially. This is surely not by design. Surely not due to a tuning error. It is surely down to shoddy implementation. Frankly, Embarcadero have a dire track record of producing correct threading code. After the debacle that was `TMonitor`, how can anyone trust them? – David Heffernan Mar 15 '15 at 20:09
  • An OTL version can simply be accomplished by changing the line `TParallel.For(1, MaxItems,`with `Parallel.For(1, MaxItems).Execute(`. Using OtlParallel should be obvious. – Uwe Raabe Mar 16 '15 at 00:00
  • @Pep, please go to http://quality.embarcadero.com and file a report. Even though we're unable to reproduce this right now, this will provide a central location where you and other can comment and add more information/test cases. It will also be linked to our internal system for tracking. There is something that we're likely missing that is the key. Thanks. – Allen Bauer Mar 16 '15 at 20:58
  • @AllenBauer You cannot reproduce it even with an AWS Windows 2012 R2 Base c3.8xlarge instance? – Pep Mar 16 '15 at 21:09
  • @Pep, I cannot reproduce it with a physical 32-cpu system. We're still working on getting an AWS instance setup (lot's of corporate red-tape to cut through right now). That is why I suggested it be at least put into the system so it can be tracked. – Allen Bauer Mar 16 '15 at 22:33

1 Answers1

15

I created two test programs, based on yours, to compare System.Threading and OTL. I built with XE7 update 1, and OTL r1397. The OTL source that I used corresponds to release 3.04. I built with the 32 bit Windows compiler, using release build options.

My test machine is a dual Intel Xeon E5530 running Windows 7 x64. The system has two quad core processors. That's 8 processors in total, but the system says there are 16 due to hyper-threading. Experience tells me that hyper-threading is just marketing guff and I've never seen scaling beyond a factor of 8 on this machine.

Now for the two programs, which are almost identical.

System.Threading

program SystemThreadingTest;

{$APPTYPE CONSOLE}

uses
  System.Diagnostics,
  System.Threading;

const
  maxItems = 5000;
  DataSize = 100000;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  TParallel.For(1, maxItems,
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

OTL

program OTLTest;

{$APPTYPE CONSOLE}

uses
  Winapi.Windows,
  Winapi.Messages,
  System.Diagnostics,
  OtlParallel;

const
  maxItems = 5000;
  DataSize = 100000;

procedure ProcessThreadMessages;
var
  msg: TMsg;
begin
  while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
    TranslateMessage(Msg);
    DispatchMessage(Msg);
  end;
end;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  Parallel.For(1, maxItems).Execute(
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  ProcessThreadMessages;

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

And now the output.

System.Threading output

Parallel matches: 19230817 in 374ms
Serial matches: 19230817 in 2423ms
Parallel matches: 19230698 in 374ms
Serial matches: 19230698 in 2409ms
Parallel matches: 19230556 in 368ms
Serial matches: 19230556 in 2433ms
Parallel matches: 19230635 in 2412ms
Serial matches: 19230635 in 2430ms
Parallel matches: 19230843 in 2441ms
Serial matches: 19230843 in 2413ms
Parallel matches: 19230905 in 2493ms
Serial matches: 19230905 in 2423ms
Parallel matches: 19231032 in 2430ms
Serial matches: 19231032 in 2443ms
Parallel matches: 19230669 in 2440ms
Serial matches: 19230669 in 2473ms
Parallel matches: 19230811 in 2404ms
Serial matches: 19230811 in 2432ms
....

OTL output

Parallel matches: 19230667 in 422ms
Serial matches: 19230667 in 2475ms
Parallel matches: 19230663 in 335ms
Serial matches: 19230663 in 2438ms
Parallel matches: 19230889 in 395ms
Serial matches: 19230889 in 2461ms
Parallel matches: 19230874 in 391ms
Serial matches: 19230874 in 2441ms
Parallel matches: 19230617 in 385ms
Serial matches: 19230617 in 2524ms
Parallel matches: 19231021 in 368ms
Serial matches: 19231021 in 2455ms
Parallel matches: 19230904 in 357ms
Serial matches: 19230904 in 2537ms
Parallel matches: 19230568 in 373ms
Serial matches: 19230568 in 2456ms
Parallel matches: 19230758 in 333ms
Serial matches: 19230758 in 2710ms
Parallel matches: 19230580 in 371ms
Serial matches: 19230580 in 2532ms
Parallel matches: 19230534 in 336ms
Serial matches: 19230534 in 2436ms
Parallel matches: 19230879 in 368ms
Serial matches: 19230879 in 2419ms
Parallel matches: 19230651 in 409ms
Serial matches: 19230651 in 2598ms
Parallel matches: 19230461 in 357ms
....

I left the OTL version running for a long time and the pattern never changed. The parallel version was always around 7 times faster than the serial.

Conclusion

The code is astonishingly simple. The only reasonable conclusion that can be drawn is that the implementation of System.Threading is defective.

There have been numerous bug reports relating to the new System.Threading library. All the signs are that its quality is poor. Embarcadero have a long track record of releasing sub-standard library code. I'm thinking of TMonitor, the XE3 string helper, earlier versions of System.IOUtils, FireMonkey. The list goes on.

It seems clear that quality is a big problem with Embarcadero. Code is released that quite clearly has not been tested adequately, if at all. This is especially troublesome for a threading library where bugs can lie dormant and only be exposed in specific hardware/software configurations. The experience from TMonitor leads me to believe that Embarcadero do not have sufficient expertise to produce high quality, correct, threading code.

My advice is that you should not use System.Threading in its current form. Until such a time as it can be seen to have sufficient quality and correctness, it should be shunned. I suggest that you use OTL.


EDIT: Original OTL version of the program had a live memory leak which occurred because of an ugly implementation detail. Parallel.For creates tasks with the .Unobserved modifier. That causes said tasks to only be destroyed when some internal message window receives a 'task has terminated' message. This window is created in the same thread as the Parallel.For caller - i.e. in the main thread in this case. As the main thread was not processing messages, tasks were never destroyed and memory consumption (plus other resources) just piled up. It is possible that because of that program hanged after some time.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • I am completely unable to reproduce this same behavior, at least on an 8 core system. I'll need to track down a >8 core system to test. FWIW, I also get a hang using OTL if left running for a while. This is the same hang I have seen with OTL where it seems to run out of some OS resource. – Allen Bauer Mar 16 '15 at 16:49
  • @AllenBauer I don't know if there's any way that I can help you by supplying diagnostics from my own executions of this code. Probably not. But if there is, then let me know. – David Heffernan Mar 16 '15 at 16:54
  • @AllenBauer I've not seen a hang in the OTL version. How long does it need to be left running for that to occur? – David Heffernan Mar 16 '15 at 17:00
  • It takes a while. If I switch to use OTL in the Conway's life demo (shipped with XE7), it happens sooner. I suspect it happens more quickly with a GUI app since it seems OTL leverages the message queues for inter-thread communication. – Allen Bauer Mar 16 '15 at 17:06
  • I'll pass that info on to Primoz – David Heffernan Mar 16 '15 at 17:07
  • I'm running the test case on a 12 core system (one of our build-machines) and it is still doing fine. Is the tipping point 16 cores? What if you call TThreadPool.Default.SetMinWorkerThreads(TThread.ProcessorCount div 2); at the start? I get the same expected results targeting Win64 on the same 12 core machine. – Allen Bauer Mar 16 '15 at 17:11
  • @Allen I'll try that and get back to you. It will take a little while before I can. I'm in the UK and it's going home time here. Thanks for taking an interest. – David Heffernan Mar 16 '15 at 17:18
  • I'm working on getting access to a 32-core system (the InterBase team has one). If I can reproduce this, I'll try and get a patch asap. It'll likely be informal at first. – Allen Bauer Mar 16 '15 at 17:51
  • @Allen I'm seeing the defective behaviour with `TThreadPool.Default.SetMinWorkerThreads(...)`. I added it just before the `while True` loop begins. And `TThread.ProcessorCount` evaluates to 16. It took quite a few iterations before the defective behaviour kicked in. – David Heffernan Mar 16 '15 at 17:53
  • @Allen Have you seen this one too: http://stackoverflow.com/questions/29012439/how-can-i-use-ttask-waitforany-from-the-new-threading-library – David Heffernan Mar 16 '15 at 17:54
  • 2
    @David The comments regarding EMBs experience (or lack thereof) seems a bit overblown. OTL has had PLENTY of bug reports, race conditions etc. I would not interpret that to mean that Primoz "does not have sufficient expertise to produce high quality, correct, threading code". It just means that threading libraries are hard, and often until someone reports the bug it is completely unanticipated. As long as EMB is working to improve the library, that is the most we can realistically hope for. – Dave Novo Mar 16 '15 at 17:59
  • 5
    @DaveNovo If it was just the threading library. But the poor quality pervades Emba's recent library code releases. TMonitor was especially bad. The XE3 string helper was astounding. Many bugs that would have been found had the code been tested, or even executed. Methods left with blank implementations. I stand by what I said. The quality is poor. The threading library is not fit for production use today. I expect that early versions of OTL had quality problems too. – David Heffernan Mar 16 '15 at 18:01
  • 1
    @DaveNovo And perhaps some of this answer was borne of pent up frustration. For instance, it still boggles me that `Set8087CW` and `SetMXCSR` are not threadsafe. With consequences that extend quite far. For instance, the Win64 variant of `TextToFloat`. Even the overload with format settings is not threadsafe. This bug was reported (by me) years ago. This is ironic at best when you consider the introduction of a new threading library. I am frustrated, for sure. – David Heffernan Mar 16 '15 at 18:06
  • On a 32 core Xeon system, I was unable to reproduce the issue after running the test for about 5 minutes. Was that sufficient time? – Allen Bauer Mar 16 '15 at 18:35
  • Yes. It goes within the first 30s for me. – David Heffernan Mar 16 '15 at 18:36
  • 1
    @DaveNovo I don't take the generalizations personally. I would agree that threading libraries are hard and so far most issues have been resolved. There may still be more issues, and feedback is critical. As for TMonitor, most issues have been related to optimizations. It now performs as well or better than a critical section, based on getting feedback. – Allen Bauer Mar 16 '15 at 18:38
  • 2
    The criticism is not meant personally. I have the highest regard for @Allen. My criticism is aimed at the quality problems that are clear for all to see. I really want the product quality to improve. I want to contribute to that. It's frustrating then to report critical bugs like SetMXCSR/Set8087CW non-thread safety, the consequent impact on FloatToText, and see the bugs remain unfixed. What is going wrong? – David Heffernan Mar 16 '15 at 18:50
  • I'm running fixed version (with added ProcessThreadMessages) now in XE2 and memory consumption is stable. We'll see if/when the program hangs ... – gabr Mar 16 '15 at 19:10
  • SystemThreadingTest fails immediately on my 24-core machine (2 CPUs with 6 hyperthreaded cores each). Parallel matches: 19230671 in 351ms Serial matches: 19230671 in 3406ms Parallel matches: 19230795 in 3536ms Serial matches: 19230795 in 3497ms Parallel matches: 19230571 in 3493ms Serial matches: 19230571 in 3416ms Delphi XE7 with all patches, program compiled in Debug but run without debugger. – gabr Mar 16 '15 at 19:20
  • @AllenBauer I think that if you try it on AWS with a c3.8xlarge (32 vCPU) instance you should be able to reproduce it. It should fail almost immediately. – Pep Mar 16 '15 at 19:28
  • It also fails on my notebook after some time (more than 5 minutes): Parallel matches: 19230549 in 3033ms Serial matches: 19230549 in 2968ms Parallel matches: 19230527 in 3016ms Serial matches: 19230527 in 2946ms Parallel matches: 19230823 in 3042ms Serial matches: 19230823 in 2959ms This one has a simple 4 core CPU, running in hyperthreading mode, so 8 virtual cores. – gabr Mar 16 '15 at 19:32
  • @gabr So the hang has been fixed? I'll be sure to grab the latest version of OTL and test it. From your comment, it seems that my hypothesis about a full message queue was correct, no? – Allen Bauer Mar 16 '15 at 19:36
  • If the hang was caused by messages not being processed in the test code, then it was fixed (by changing the test code). Otherwise, it was not. I'm now running the 'otl' version of the test overnight to see if there will be any problems. (Damn, multithreading is hard!) – gabr Mar 16 '15 at 19:37
  • @Pep We're also working on spooling up an AWS instance to test this. As I mentioned, I did try on a 32core Xeon system running Server 2012 and was unable to reproduce. What OS are you running on the AWS instance? – Allen Bauer Mar 16 '15 at 19:37
  • @AllenBauer Here's a log from my notebook (8 cores): https://www.dropbox.com/s/psowizjwcipdvpm/threading.log – gabr Mar 16 '15 at 19:41
  • @AllenBauer It was the "Microsoft Windows Server 2012 R2 Base" AMI that is offered when you launch (QuickStart tab). I did it in the EU West region. – Pep Mar 16 '15 at 19:42
  • @AllenBauer This article mentions something about CPU protection: http://delphiaball.co.uk/tag/parallel-programming/ If that is true, I wonder if the code that tries to assess CPU overheating is overcautious and throttles threads because it gets into panic mode too soon when there are many cores involved. – Pep Mar 16 '15 at 19:52
  • "Damn, multithreading is hard" @gabr I can certainly agree with that sentiment. I would amend that to be "Damn, bullet-proof test-cases are even harder!" – Allen Bauer Mar 16 '15 at 20:20
  • @Pep Yes, the thread pool does keep track of cpu loading, but that is intended to keep the number of threads from growing to a point that it over-subscribes the CPU. If there are many threads in the pool, but they're not loading the CPU because they may be blocked within a task, new threads will be spooled to service any pending tasks. As long as the threads in the pool are doing something, new threads won't be created... that is the theory, at least. Our internal tests certainly demonstrated that. – Allen Bauer Mar 16 '15 at 20:25
  • 4
    @Allen Bauer Dynamic thread pools based on CPU usage are very fragile as soon as they involve I/O or GPU. Lesson learned ten years ago. You can end up piling work while waiting on I/O (low CPU usage), and when the I/O data starts flowing, you have too many threads up, starving CPU cache, etc. Rinse & repeat. – Eric Grange Mar 17 '15 at 08:39
  • @EricGrange That is a work-item; add a way to flag certain tasks as being I/O-related so they are segregated.They would then be tracked independently. It would, of course, require the user to be fully aware of the kinds of tasks they're scheduling. – Allen Bauer Mar 17 '15 at 17:02
  • 3
    @Allen Bauer Might not be trivial, I/O tasks can be of the I/O-then-process kind (load from a DB then process then save to DB, load from file then process then I/O to GPU, etc.) so it also requires breaking up the tasks. Another non trivial issue is to get a meaningful CPU usage measurement when running in a VM. – Eric Grange Mar 19 '15 at 12:58
  • does Seattle 10 fix the problem? – justyy Oct 08 '15 at 11:07