11

I have an app with a number of worker threads, one for each core. On a modern 8 core machine, I have 8 of these threads. My app loads many plugins, which also have their own worker threads. Because the app uses huge blocks of memory (photos, e.g. 200 MB) I have a memory fragmentation problem (32 bit app). The problem is that every thread allocates the {$MAXSTACKSIZE ...} of address space. It's not using the physical memory but the address space. I reduced the MAXSTACKSIZE from 1 MB to 128 KB, and it seems to work, but I don't know if I'm near to the limit. Is there any possibility to measure how much stack is really used?

Steffen Binas
  • 1,463
  • 20
  • 30
  • You can set the stack size separately for each thread, although Delphi TThread implementation does not surface it (see QC #77203), instead of changing the global setting. –  May 27 '11 at 10:20
  • 2
    here's a link to article QC77203: http://qc.embarcadero.com/wc/qcmain.aspx?d=77203 – Johan May 27 '11 at 12:43

6 Answers6

12

Use this to compute the amount of memory committed for the current thread's stack:

function CommittedStackSize: Cardinal;
asm
  mov eax,[fs:$4] // base of the stack, from the Thread Environment Block (TEB)
  mov edx,[fs:$8] // address of lowest committed stack page
                  // this gets lower as you use more stack
  sub eax,edx
end;

Another idea I don't have.

Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467
opc0de
  • 11,557
  • 14
  • 94
  • 187
  • @opc0de: Why did you use pastebin instead of embedding your code on so? – Jens Mühlenhoff May 27 '11 at 09:35
  • Works nice! "var a: array[0..1024*100] of Int64;" gives stack of 800kb :-) – André May 27 '11 at 10:05
  • @Jens Mühlenhoff, pastebin have many advantages over locally embedded code block, including but not limited to line numbering and proper syntax highlighting – Premature Optimization May 27 '11 at 11:59
  • @user, and one huge disadvantage...... besides your IDE will do line numbering and syntax highlighting after a copy paste. – Johan May 27 '11 at 12:45
  • @opc0de, you need to save the ebx register, Delphi assumes only EAX, ECX and EDX are changed, and may error when ebx is changed. replace EBX with EDX, then your code is save! – Johan May 27 '11 at 12:47
  • @Johan You are right Johan but in my experience modifying ebx register doesn't crash the app but i will modify it. – opc0de May 27 '11 at 12:51
  • One thing Stack Overflow has taught me, @User, is that syntax highlighting doesn't need to be particularly "proper." Most of the time, the generic highlighting it uses for Delphi is just fine for making code more readable. As for line numbers, [that's been discussed here before](http://meta.stackexchange.com/q/7119/33732). – Rob Kennedy May 27 '11 at 12:53
  • 4
    Opc0de, could you please put some comments in that code to explain what it's doing? – Rob Kennedy May 27 '11 at 12:56
  • @opc0de, I changed the ebx register to edx, so the push-pop is not needed. – Johan May 27 '11 at 12:59
  • At fs:$0 starts the TEB ( Thread envoirment block ) fs:$8 and fs:$4 represent stack end and start it subs one value to another to obtain stack space and it stores in var eu. – opc0de May 27 '11 at 13:01
  • @Rob Kennedy, looks like you are *enduring* what current highlighter does to blocks ;) actually a bad lesson since current highligter more likely *is defunct* rather than *works*. Proper highlighter **must** respect a language it decorates or it will colour code at random. For more accurate failure rate look at [this](http://stackoverflow.com/questions/6122128/am-i-restricted-to-the-cursor-numbers-defined-in-controls-pas-under-delphi-7/6134366#6134366). See, comments are ignored at all and processed as statements. Exactly the same pattern as above, with with suddenly "special" `sub`. – Premature Optimization May 27 '11 at 15:33
  • @Rob Kennedy (continuation) RE: line numbering - all i see there is excuses and reasoning why not to implement it. I'm puzzled. – Premature Optimization May 27 '11 at 15:40
  • @Johan, what you are implying is to replace machine labour with manual preparation steps. This is inacceptable. – Premature Optimization May 27 '11 at 15:47
  • @user, no having to switch to another site where you have no question, no context, no comments and no competing answers in unacceptable. If I copy paste, I can put the IDE on screen2 and keep both of them in view. If I open the link it will hide SO from view. besides. ^A, ^C, Alt-tab, ^V, where's the work? – Johan May 27 '11 at 15:53
  • @Johan, why not to open pastebin on the secondary display then? Face it, pastebin is better at presenting source code snippets (and, BTW, copypasting code to the IDE is easier from there either). An idea of doing machine's work is very disturbing itself. – Premature Optimization May 27 '11 at 16:17
  • 1
    @user, The idea of having to click out of SO to see code is disturbing! Esp for a 7 line piece of code. Its a classic case of `putting will on might` (You **might** want to have line numbers and syntax highlighting, so I **will** put it on external site X), even though I never care about these things. – Johan May 27 '11 at 16:32
  • @Johan, personally, i prefer well respected external site. Of course, the idea is to have all of programmer tool on hand (its programmers' website, after all) but developers dont want to implement it (they have their reasons, eg: keeping hosting very cheap), while users are getting used to absence of it too easily (actually, w/o any rational reasons) and do not demand it. Circulus vitiosus! RE: 7 LoC - this remark is underestimation, avg. height of code blocks i've seen here is much greater. – Premature Optimization May 27 '11 at 20:15
  • @opc0de: I used another solution but your answer led me to this approach: Because I'd like the usage of all threads I monitor all threads using a separate thread during the whole program runtime: Enumerate all threads, Calculate that stack usage and save the maximum value. I finally got the stack usage for a thread via NtQueryInformationThread (from JclWin32.pas). I found out that my "largest" thread used 292KB. – Steffen Binas May 30 '11 at 11:58
9

For the sake of completeness, I am adding a version of the CommittedStackSize function provided in opc0de's answer for determining the amount of used stack that will work both for x86 32- and 64-bit versions of Windows (opc0de's function is for Win32 only).

opc0de's function queries the address of the base of the stack and the lowest committed stack base from Window's Thread Information Block (TIB). There are two differences among x86 and x64:

  • TIB is pointed to by the FS segment register on Win32, but by the GS on Win64 (see here)
  • The absolute offsets of items in the structure differ (mostly because some items are pointers, i.e. 4 bytes and 8 bytes on Win32/64, respectively)

Additionally note that there is a small difference in the BASM code, because on x64, abs is required to make the assembler use an absolute offset from the the segment register.

Therefore, a version that will work on both Win32 and Win64 version looks like this:

{$IFDEF MSWINDOWS}
function CommittedStackSize: NativeUInt;
//NB: Win32 uses FS, Win64 uses GS as base for Thread Information Block.
asm
 {$IFDEF WIN32}
  mov eax, [fs:04h] // TIB: base of the stack
  mov edx, [fs:08h] // TIB: lowest committed stack page
  sub eax, edx      // compute difference in EAX (=Result)
 {$ENDIF}
 {$IFDEF WIN64}
  mov rax, abs [gs:08h] // TIB: base of the stack
  mov rdx, abs [gs:10h] // TIB: lowest committed stack page
  sub rax, rdx          // compute difference in RAX (=Result)
 {$ENDIF}
{$ENDIF}
end;
PhiS
  • 4,540
  • 25
  • 35
3

I remember i FillChar'd all available stack space with zeroes upon init years ago, and counted the contiguous zeroes upon deinit, starting from the end. This yielded a good 'high water mark', provided you send your app through its paces for probe runs.

I'll dig out the code when i am back nonmobile.

Update: OK the principle is demonstrated in this (ancient) code:

{***********************************************************
  StackUse - A unit to report stack usage information

  by Richard S. Sadowsky
  version 1.0 7/18/88
  released to the public domain

  Inspired by a idea by Kim Kokkonen.

  This unit, when used in a Turbo Pascal 4.0 program, will
  automatically report information about stack usage.  This is very
  useful during program development.  The following information is
  reported about the stack:

  total stack space
  Unused stack space
  Stack spaced used by your program

  The unit's initialization code handles three things, it figures out
  the total stack space, it initializes the unused stack space to a
  known value, and it sets up an ExitProc to automatically report the
  stack usage at termination.  The total stack space is calculated by
  adding 4 to the current stack pointer on entry into the unit.  This
  works because on entry into a unit the only thing on the stack is the
  2 word (4 bytes) far return value.  This is obviously version and
  compiler specific.

  The ExitProc StackReport handles the math of calculating the used and
  unused amount of stack space, and displays this information.  Note
  that the original ExitProc (Sav_ExitProc) is restored immediately on
  entry to StackReport.  This is a good idea in ExitProc in case a
  runtime (or I/O) error occurs in your ExitProc!

  I hope you find this unit as useful as I have!

************************************************************)

{$R-,S-} { we don't need no stinkin range or stack checking! }
unit StackUse;

interface

var
  Sav_ExitProc     : Pointer; { to save the previous ExitProc }
  StartSPtr        : Word;    { holds the total stack size    }

implementation

{$F+} { this is an ExitProc so it must be compiled as far }
procedure StackReport;

{ This procedure may take a second or two to execute, especially }
{ if you have a large stack. The time is spent examining the     }
{ stack looking for our init value ($AA). }

var
  I                : Word;

begin
  ExitProc := Sav_ExitProc; { restore original exitProc first }

  I := 0;
  { step through stack from bottom looking for $AA, stop when found }
  while I < SPtr do
    if Mem[SSeg:I] <> $AA then begin
      { found $AA so report the stack usage info }
      WriteLn('total stack space : ',StartSPtr);
      WriteLn('unused stack space: ', I);
      WriteLn('stack space used  : ',StartSPtr - I);
      I := SPtr; { end the loop }
    end
    else
      inc(I); { look in next byte }
end;
{$F-}


begin
  StartSPtr := SPtr + 4; { on entry into a unit, only the FAR return }
                         { address has been pushed on the stack.     }
                         { therefore adding 4 to SP gives us the     }
                         { total stack size. }
  FillChar(Mem[SSeg:0], SPtr - 20, $AA); { init the stack   }
  Sav_ExitProc := ExitProc;              { save exitproc    }
  ExitProc     := @StackReport;          { set our exitproc }
end.

(From http://webtweakers.com/swag/MEMORY/0018.PAS.html)

I faintly remember having worked with Kim Kokkonen at that time, and I think the original code is from him.

The good thing about this approach is you have zero performance penalty and no profiling operation during the program run. Only upon shutdown the loop-until-changed-value-found code eats up CPU cycles. (We coded that one in assembly later.)

TheBlastOne
  • 4,291
  • 3
  • 38
  • 72
1

Even if all 8 threads were to come close to using their 1MB of stack, that's only 8MB of virtual memory. IIRC, the default initial stack size for threads is 64K, increasing upon page-faults unless the process thread-stack limit is reached, at which point I assume your process will be stopped with a 'Stack overflow' messageBox :((

I fear that reducing the process stack limit $MAXSTACKSIZE will not alleviate your fragmentation/paging issue much, if anything. You need more RAM so that the resident page set of your mega-photo-app is bigger & so thrashing reduced.

How many threads are there, overall, on average, in your process? Task manager can show this.

Rgds, Martin

Martin James
  • 24,453
  • 3
  • 36
  • 60
0

Whilst I am sure that you can reduce the thread stacksize in your app, I don't think it will address the root cause of the problem. You are using an 8 core machine now, but what happens on a 16 core, or a 32 core etc.

With 32 bit Delphi you have a maximum address space of 4GB and so this does limit you to some degree. You may well need to use smaller stacks for some or all of your threads, but you will still face problems on a big enough machine.

If you help your app scale better to larger machines you may need to take one or other of the following steps:

  1. Avoid creating significantly more threads than cores. Use a thread pool architecture that is available to your plug-ins. Without the benefit of the .net environment to make this easy you will be best coding against the Windows thread pool API. That said, there must be a good Delphi wrapper available.
  2. Deal with the memory allocation patterns. If your threads are allocating contiguous blocks in the region of 200MB then this is going to cause undue stress on your allocator. I have found that it is often best to allocate such large amounts of memory in smaller, fixed size blocks. This approach works around the fragmentation problems you are encountering.
David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • A Threadpool is definitely the way to go for future development. As for splitting images into blocks: This would make any image manipulation code (based on Gr32) not work or much more complicated (imaging rendering a text over a tile based image). – Steffen Binas May 30 '11 at 08:50
0

Reducing $MAXSTACKSIZE won't work because Windows will always align thread stack to 1Mb (?).

One (possible?) way to prevent fragmentation is to reserve (not alloc!) virtual memory (with VirtualAlloc) before creating threads. And release it after the threads are running. This way Windows cannot use the reserved space for the threads so you will have some continuous memory.

Or you could make your own memory manager for large photo's: reserve a lot virtual memory and alloc memory from this pool by hand. (you need to maintain a list of used and used memory yourself).

At least, that's a theory, don't know if it really works...

André
  • 8,920
  • 1
  • 24
  • 24