1

I maintain an application in Delphi 7 which have a server part that can be compiled with CrossKylix. For performance matter I'm benching multiThreading and Critical section use.

I made a console application that create 100 TThread and each TThread compute a fibonacci. Then I add a critical section so that only one thread compute a fibonacci at a time. As expected, the application is faster without the Critical section.

Then I made a console application that create 100 TThread and each TThread add words in a local TStringList and sort that TStringList. Then I add a critical section so that only one thread is executing at a time. On Windows, as expected, the application runs faster without the Critical section. On Linux the CriticalSection version runs 2 times faster than the version without Critical Section.

The CPU on Linux is an AMD Opteron with 6 cores so the app should benefit from multithreading.

Can somebody explain why the version with Critical section is faster?


Edit (add some code)

Threads creation and waiting

tmpDeb := Now;
i := NBTHREADS;
while i > 0 do
begin
    tmpFiboThread := TFiboThread.Create(true);
    tmpFiboThread.Init(i, ParamStr(1) = 'Crit');
    Threads.AddObject(IntToStr(i), tmpFiboThread);
    i := i-1;
end;

i := 0;
while i < NBTHREADS do
begin
    TFiboThread(Threads.Objects[i]).Resume;
    i := i+1; 
end;

i := 0;
while i < NBTHREADS do
begin
    TFiboThread(Threads.Objects[i]).WaitFor;
    i := i+1; 
end;

WriteLn('Traitement total en : ' + inttostr(MilliSecondsBetween(Now, tmpDeb)) + ' milliseconds');

The TThread and Critical section use

    type TFiboThread = class(TThread)
        private
            n : Integer;
            UseCriticalSection : Boolean;
        protected
            procedure Execute; override;

        public      
            ExecTime : Integer;

            procedure Init(n : integer; WithCriticalSect : Boolean);
    end;

var
  CriticalSection : TCriticalSection;

implementation

uses DateUtils;

function fib(n: integer): integer;
var
  f0, f1, tmpf0, k: integer;
begin
    f1 := n + 100000000;
    IF f1 >1 then
    begin
      k := f1-1;
      f0 := 0;
      f1 := 1;
      repeat
        tmpf0 := f0;
        f0 := f1;
        f1 := f1+tmpf0;
        dec(k);
      until k = 0;
    end
    else
      IF f1 < 0 then
        f1 := 0;
    fib := f1;
end;

function StringListSort(n: integer): integer;
var
  tmpSL : TStringList;
  i : Integer;
begin
    tmpSL := TStringList.Create;
    i := 0;
    while i < n + 10000 do
    begin
        tmpSL.Add(inttostr(MilliSecondOf(now)));
        i := i+1;
    end;
    tmpSL.Sort;

    Result := StrToInt(tmpSL.Strings[0]);
    tmpSL.Free;
end;

{ TFiboThread }

procedure TFiboThread.Execute;
var
  tmpStr : String;
  tmpDeb : TDateTime;
begin
    inherited;

    if Self.UseCriticalSection then
        CriticalSection.Enter;

    tmpDeb := Now;

    tmpStr := inttostr(fib(Self.n));
    //tmpStr := inttostr(StringListSort(Self.n));

    Self.ExecTime := MilliSecondsBetween(Now, tmpDeb);

    if Self.UseCriticalSection then
        CriticalSection.Leave;

    Self.Terminate;
end;

procedure TFiboThread.Init(n : integer; WithCriticalSect : Boolean);
begin
    Self.n := n;
    Self.UseCriticalSection := WithCriticalSect;
end;



initialization
    CriticalSection := TCriticalSection.Create;

finalization
    FreeAndNil(CriticalSection);

Edit 2

I read this why-using-more-threads-makes-it-slower-than-using-less-threads so as I understand this, the context switching cost a lot more CPU resource with Linux and Kylix compilation than context switching with win32.

Community
  • 1
  • 1
  • 2
    Can you show your demo code please. Performance is a tricky subject at the best of times. Commenting on that which we cannot see is tricky and prone to error. – David Heffernan Jan 27 '15 at 17:05
  • My first guess for why the difference is that your problem may be cache line contention. That is, your threads are allocated at the same time and whether they occupy the same cache line will depend on the behaviour of the heap allocator. If you make sure each TThread object has additional unused space making it bigger than the cache line of the processor you may see behaviour changes. I haven't checked but your opteron may have 128byte cache lines in which case you may have several threads per cache line. – Kanitatlan Jan 28 '15 at 17:35
  • I tried to add a TByteDynArray in my TFiboThread and SetLength(mybyteArray, 255); in the constructor so each TFiboThread instance should be bigger than 255bytes but it doesn't change the behavior... – tenpigs pinget Jan 30 '15 at 09:08

1 Answers1

0

Sorting stringlist have a lot of memory allocations i.e. calls to memory manager. Memory manager itself is thread safe, means that it use some kind of critical section inside. So, if hundred threads runing simulatinusly without global critical section, they will do thouthand of calls to MM which means thouthand of internal locks (instead of one lock of global critical section)

Thats why pure fibonacci function (without stringlist building and sorting) works as expected - it does not have internal, hided locks