4

I have a multi-threaded process. Each thread is CPU bound (performs calculations) and also uses a lot of memory. The process starts with 100% cpu utilization according to resource monitor, but after several hours, cpu utilization starts to degrade, slowly. After 24 hours, it's on 90-95% and falling.

The question is - what should I look for, and what best-known-methods can I use to debug this?

Additional info:

I have enough RAM - most of it is unused at any given moment. According to perfmon - memory doesn't grow (so I don't think it's leaking). The code is a mix of .Net and native c++, with some data marshaling back and forth. I saw this on several different machines (servers with 24 logical cores). One thing I saw in perfmon - Modified Page List Bytes indicator increases over time as CPU utilization degrades.

Edit 1 One of the third party libraries that is used is openfst. Looks like it's very related to some mis-usage of that library. Specifically, I noticed that I have the following warnings: warning LNK4087: CONSTANT keyword is obsolete; use DATA

Edit 2

Since the question is closed, and wasn't reopened, I will write my findings and how the issue was solved in the body of the question (sorry) for future users. Turns out there is an openfst.def file that defines all the openfst FLAGS_* symbols to be used by consuming applications/dlls. I had to fix those to use the keyword "DATA" instead of "CONSTANT" (CONSTANT is obsolete because it's risky - more info: https://msdn.microsoft.com/en-us/library/aa271769(v=vs.60).aspx). After that - no more degradation in CPU utilization was observed. No more rise in "modified page list bytes" indicator. I suspect that it was related to the default values of the FLAGS (specifically the garbage collection flags - FLAGS_fst_default_cache_gc) which were non deterministic because of the misusage of CONSTANT keyword in openfst.def file.

Conclusion Understand your warnings! Eliminate as much of them as you can! Thanks.

lev haikin
  • 558
  • 4
  • 17
  • Use a profiler, take a sample at the start and a sample when slow, compare the two. – Scott Chamberlain Nov 27 '15 at 20:19
  • Very random guess - memory fragmentation causes memory management to take longer time? – SergeyA Nov 27 '15 at 20:22
  • 1
    If CPU usage is less than 100% it's because SOMETHING is blocking your process from running. Most likely is that you are actually running out of memory, and need to swap. – Mats Petersson Nov 27 '15 at 20:23
  • @MatsPetersson there are 32GB of memory, while only ~10GB are consumed at any given moment... – lev haikin Nov 27 '15 at 20:50
  • @ScottChamberlain thanks for the advice. Do you have a recommendation for a specific profiler? I saw xperf can give a stack trace of managed+unmanaged execution paths. Any experience with that? – lev haikin Nov 27 '15 at 20:53
  • There are two reasons a process does not get 100% CPU: The OS blocks it (paging or shared memory operations that need to be "swapped in") or some sort of lock that is held such that the process doesn't run. – Mats Petersson Nov 27 '15 at 21:01
  • @MatsPetersson ok, lock is a good advice - but how do I find it? Can a profiler help with that? – lev haikin Nov 27 '15 at 21:07
  • @MatsPetersson can paging happen even if most of memory is unused? What do you mean by shared memory operations? – lev haikin Nov 27 '15 at 21:08
  • 1
    Exactly which kind of memory use isn't growing according to perfmon? Your committed memory or your resident memory? If committed memory is growing and resident memory not growing, your symptoms exactly fit a flaw in Window's memory management. "Soft" paging can occur when most of memory is unused and that results in significant waste of CPU time by the kernel, leaving less for the process. – JSF Nov 27 '15 at 21:13
  • Sorry, meant "shared memory" = "memory mapped file". – Mats Petersson Nov 27 '15 at 21:16
  • A good OS-aware profiler should be able to tell the time your code spends waiting on OS-based waitable objects. Which is pretty much the only way that an application can wait in itself (obviously read file, write to file, sleep and such are also waitable operations deep down in the OS) – Mats Petersson Nov 27 '15 at 21:20
  • Look at the accepted answer to this question: http://stackoverflow.com/questions/5684365/what-causes-page-faults In the terminology of that answer, my guess at your problem is the kernel is rotating pages too often between your working set and the standby list. The info in that question and answer should tell you what to look at to see if my guess is correct. – JSF Nov 27 '15 at 21:22
  • 1
    @levhaikin "lock is a good advice - but how do I find it?" - try JetBrains dotTrace in "Timeline" mode. It's awesome to find thinks like blocking GC, I/O operations or locking on synchronization events. https://www.jetbrains.com/profiler/help/Concurrency_Profiling_Timeline_.html – Ed Pavlov Nov 28 '15 at 12:16
  • @JSF thanks. The only thing that grows with time is "Modified Page List Bytes" indicator in perfmon. Committed bytes is not growing. – lev haikin Nov 28 '15 at 20:48
  • 1
    @levhaikin I also agree with Ed, Timeline mode in dotTrace is what I would recommend using. It has a free 30 day trial. – Scott Chamberlain Nov 30 '15 at 15:54
  • @ScottChamberlain does it support mixed code dlls (managed+native code)? – lev haikin Nov 30 '15 at 17:52
  • It does not go in to as much detail, it is just a opaque blob while it is in native code but you can see how long it spent in that native code. – Scott Chamberlain Nov 30 '15 at 20:28
  • I've edited the question, adding more specific details. In addition, looks like I found the problem, and fixed it. I would like to answer the question with complete details about the investigation, and the fix. If you think it's important - please vote to reopen the question. – lev haikin Dec 08 '15 at 06:24

1 Answers1

0

For a non-obvious issue like this, you should also use a profiler that actually samples the underlying hardware counters in the CPU. Most profilers that I’m familiar with use kernel supplied statistics and not the underlying HW counters. This is especially true in Windows. (The reason is in part legacy, and in part that Windows wants its kernel statistics to be independent of hardware. PAPI APIs attempt to address this but are still relatively new.)

One of the best profilers is Intel’s VTune. Yes, I work for Intel but the internal HPC people use VTune as well. Unfortunately, it costs. If you’re a student, there are discounts. If not, there is a trial period.

You can find a lot of optimization and performance issue diagnosis information at software.intel.com. Here are pointers for optimization and for profiling. Even if you are not using an x86 architecture, the techniques are still valid.

As to what might be the issue, a degradation that slow is strange.

  • How often do you use new memory or access old? At what rate? If the rate is very slow, you might still be running into a situation where you are slowing using up a resource, e.g. pages.
  • What are your memory access patterns? Does it change over time? How rapidly? Perhaps your memory access patterns over time are spreading, resulting in more cache misses.
  • Perhaps your partitioning of the problem space is such that you have entered a new computational domain and there is no real pathology.
  • Look at whether there are periodic maintenance activities that take place over a longer interval, though this would result in a periodic degradation, say every 24 hours. This doesn’t sound like your situation since you are experiencing is a gradual degradation.

If you are using an x86 architecture, consider submitting a question in an Intel forum (e.g. "Intel® Clusters and HPC Technology" and "Software Tuning, Performance Optimization & Platform Monitoring").

Let us know what you ultimately find out.

Taylor Kidd
  • 1,463
  • 1
  • 9
  • 11