-2

Could anyone help me with this problem?

I'm trying to achieve more efficiency that's why I tried to parallel my calculations. After a few tests the results showed me that nothing is faster than calculation on 1 thread. its only 25% of processor load in both cases (1 thread and 4 threads). Could any one know why it's happening? Is there anything I can do to achieve 100% of efficiency (even 90% be better than 25%)?

Below you can see an example code:


ToolsThread = class(TThread)
public
 procedure Execute(); override;
 procedure QuickSortT(var dict: TArray<AnsiString>; iLo, iHi: Integer);
 Procedure QSortT(var dict: TArray<AnsiString>);
 constructor Create();
var
 tab : TArray<AnsiString>;
 tmp1: Longint;
end;

procedure ToolsThread.QuickSortT(var dict: TArray<AnsiString>; iLo, iHi: Integer);
var
 Lo, Hi: Longint;
 Pivot: Pointer;
 T: Pointer;
begin
  Lo := iLo;
  Hi := iHi;
  Pivot := pointer(dict[(Lo + Hi) shr 1]); // shr 1 is slightly faster than div 2;
  repeat
    while dict[Lo] < AnsiString(Pivot) do Inc(Lo);
    while dict[Hi] > AnsiString(Pivot) do Dec(Hi);
    if Lo <= Hi then
    begin
      T := pointer(dict[Lo]);
      pointer(dict[Lo]) := pointer(dict[Hi]);
      pointer(dict[Hi]) := T;
      Inc(Lo) ;
      Dec(Hi) ;
    end;
  until Lo > Hi;
  if Hi > iLo then QuickSort(dict, iLo, Hi) ;
  if Lo < iHi then QuickSort(dict, Lo, iHi) ;
end;

Procedure ToolsThread.QSortT(var dict: TArray<AnsiString>);
begin
 QuickSort(dict, 0, Length(dict)-1);
end;

procedure ToolsThread.Execute();
var
 tmp1, tmp2 : Longint;
 dict: TArray<AnsiString>;
begin
 SetLength(dict, 10000000);
 for tmp1:= 0 to 10000000-1 do
  dict[tmp1] := IntToStr(Random(high(integer)));
 QSortT(dict);
end;

Procedure Main;
var
 Th1, Th2, Th3, Th4: ToolsThread;
begin

 Th1 := ToolsThread.Create();
 Th2 := ToolsThread.Create();
 Th3 := ToolsThread.Create();
 Th4 := ToolsThread.Create();

 debug('Start THR');
 Th1.Start;
 Th2.Start;
 Th3.Start;
 Th4.Start;
 th1.WaitFor;
 th2.WaitFor;
 th3.WaitFor;
 th4.WaitFor;
 debug('THR Done');
end;

Corrected according suggestions. Still 25% CPU load (5-8% per thread)

SOLVED! There is a general problem with some Delphi memory management in multiprocessing. It's not fastMM4 problem and it is resolvable only as workaround for now.

wcale
  • 23
  • 2

2 Answers2

2

First of all, use string everywhere. Don't mess with AnsiString and string types - unless you are with a pre-Delphi 2009... And is you use AnsiString, then IntToStr() will return a plain string which will then be reallocated and converted into an AnsiString when put in the array...

From what I can see in your code, my guess is that most of the time is spent into calling IntToStr(), then converting the returned string into AnsiString, which will involve the heap manager, which will be mostly serialized in the process. Of course, it is a quick guess, with my "brain profiler" (tm) which uses to be wrong.

A QuickSort algorithm is O(n*log(n)) which is faster than O(n) calls of IntToStr() from multiple threads.

Try putting only QSortT(dict) in the Execute method. It should be almost linear.

Another guess is that the following lines are not optimal. They will involve reference counting when swapping strings (assigning strings is in fact a call to an hidden _UStrAsg() low function), which takes a lot of CPU power due to a "lock" on the reference counter. If multiple threads share some CPU cache during the "lock" (which may occur in your case, since IntToStr allocations are done in parallel), then this "lock" would create some cache contention, which could take some noticeable time, even on newer CPUs.

The following may be faster (to be verified):

   var T: pointer;
   ...
    if Lo <= Hi then
    begin
      T := pointer(dict[Lo]);
      pointer(dict[Lo]) := pointer(dict[Hi]);
      pointer(dict[Hi]) := T;
      Inc(Lo) ;
      Dec(Hi) ;
    end;

You may also replace pivot with a pointer but it is a bit more tricky:

var pivot: pointer;
...
  repeat
    Pivot := pointer(dict[(Lo + Hi) shr 1]); // within the 2nd repeat..until
    while dict[Lo] < string(Pivot) do Inc(Lo) ;
    while dict[Hi] > string(Pivot) do Dec(Hi) ; 

Idea is always to use real profiling in such cases. At least, use precision timer/watch in the code you want to measure, and see where the CPU goes into.

Edit: I have written some code using my advices - see https://gist.github.com/synopse/02eb142d35cb44126aed9fd3200a76d1

output is on a 2 core CPU:

Fill done in 3.25s
Sort done in 24.11s
  • with one core used during Fill/Create - as expected
  • with 100% of cores used during Sort/Execute - as expected
Community
  • 1
  • 1
Arnaud Bouchez
  • 42,305
  • 3
  • 71
  • 159
  • Pointer not helped. ` for tmp1:= 0 to 10000000-1 do dict[tmp1] := IntToStr(Random(high(integer)));` Taking 100% of CPU, which is ok, but `QSortT(dict);` in 4 thread taking 25% (5-8% each) – wcale Mar 02 '20 at 10:36
  • Still the same. – wcale Mar 02 '20 at 10:45
  • You did **NOT** put the IntStr() outside the threads, which was my first advice. Generate the dictionaries e.g. in ToolsThread.Create to make memory allocation sequential, with no heap contention, and potentially less CPU cache collisions. – Arnaud Bouchez Mar 02 '20 at 13:40
  • This is definitely memory manager contention - I think you've identified that correctly. – J... Mar 02 '20 at 16:17
0

Your line Pivot := dict[(Lo + Hi) div 2] assigns a part of an AnsiString to a variable of type String. You're doing it over and over. I don't really know what's happening behind the scenes, but after consequently changing all your types to either String or AnsiString, the code will then run 100 % CPU load instead of just using one core.

You must pay more attention to compiler warnings

  • That's not a problem, I corrected this and problem remains. still 25% of CPU usage. – wcale Mar 02 '20 at 10:41
  • I didn't use a profiler, but even stopping the application using the debugger showed how memory was constantly allocated and freed by assigning the `Pivot` variable. Changing *all the variables* to the same types yielded a complete 100% CPU usage for me ‍♂️ – Günther the Beautiful Mar 02 '20 at 10:51
  • Corrected (First post updated), not solved the problem. – wcale Mar 02 '20 at 11:05
  • I tried exactly your code above. Fixing one typo and adding the missing constructor, I'm getting a CPU load of circa 95 %. – Günther the Beautiful Mar 02 '20 at 11:19
  • 3
    We don't really know what code the asker is running, because only part of it is posted, and we don't know the Delphi version. This is why we need to insist on a full repro. – David Heffernan Mar 02 '20 at 11:24
  • @GünthertheBeautiful Really? I have 100% with filling arrays, and the sorting takes 25% - no more till the end. My Delphi version is Tokyo. Maybe some suggestion what I do wrong? – wcale Mar 02 '20 at 11:50
  • @GünthertheBeautiful Ok, I checked once again in the same way as you (only this part with clean code and nothing more)... I'm surprised. It's working? So the problem is somewhere else. I don't know yet, but code is ok. – wcale Mar 02 '20 at 11:59
  • 3
    This is, yet again, why we need to see a [mcve]. I hope that this lesson hits home now. – David Heffernan Mar 02 '20 at 12:01
  • @wcale: let me guess, are you using Synchronize() somewhere in your code? – whosrdaddy Mar 02 '20 at 12:50
  • @whosrdaddy, No, I think the problem lays in tbbmalloc.dll, from ParallelSort library. I removed that from my code, because that library was unstable. I leave only declaration in uses. After removed this from uses, this code works with almost 100% CPU load. – wcale Mar 02 '20 at 14:27
  • 2
    Best to remove the question then since it depends on information not present in the question – David Heffernan Mar 02 '20 at 15:24