1

I'm profiling a multithreaded program running with different numbers of allowed threads. Here are the performance results of three runs of the same input work.

1 thread:
  Total thread time: 60 minutes.
  Total wall clock time: 60 minutes.

10 threads:
  Total thread time: 80 minutes. (Worked 33% longer)
  Total wall clock time: 18 minutes.  3.3 times speed up

20 threads
  Total thread time: 120 minutes. (Worked 100% longer)
  Total wall clock time: 12 minutes.  5 times speed up

Since it takes more thread time to do the same work, I feel the threads must be contending for resources.

I've already examined the four pillars (cpu, memory, diskIO, network) on both the app machine and the database server. Memory was the original contended resource, but that's fixed now (more than 1G free at all times). CPU hovers between 30% and 70% on the 20 thread test, so plenty there. diskIO is practically none on the app machine, and minimal on the database server. The network is really great.

I've also code-profiled with redgate and see no methods waiting on locks. It helps that the threads are not sharing instances. Now I'm checking more nuanced items like database connection establishing/pooling (if 20 threads attempt to connect to the same database, do they have to wait on each other?).

I'm trying identify and address the resource contention, so that the 20 thread run would look like this:

20 threads
  Total thread time: 60 minutes. (Worked 0% longer)
  Total wall clock time: 6 minutes.  10 times speed up

What are the most likely sources (other than the big 4) that I should be looking at to find that contention?


The code that each thread performs is approximately:

Run ~50 compiled LinqToSql queries
Run ILOG Rules
Call WCF Service which runs ~50 compiled LinqToSql queries, returns some data
Run more ILOG Rules
Call another WCF service which uses devexpress to render a pdf, returns as binary data
Store pdf to network
Use LinqToSql to update/insert. DTC is involved: multiple databases, one server.

The WCF Services are running on the same machine and are stateless and able to handle multiple simultaneous requests.


Machine has 8 cpu's.

Amy B
  • 108,202
  • 21
  • 135
  • 185
  • suggest you post the code/algorithm you are trying to parallelise – Mitch Wheat Oct 20 '11 at 05:03
  • How many CPUs does your machine have? – Hand-E-Food Oct 20 '11 at 05:06
  • Your operations are mostly IO bound, are these IO calls asynchronous? if not try to make them asyn and then see if you get any benifit – Ankur Oct 20 '11 at 05:41
  • Ok all, I appreciate that instinct to help me "solve" the performance problem. At this stage, measuring the problem comes before the solution. I'm asking for a couple more things to measure. – Amy B Oct 20 '11 at 05:56
  • take a few stack dumps during the process and you're to see the contention yourself. – bestsss Oct 20 '11 at 06:25
  • how have you set up your wcf services? are you using Single/PerSession/PerCall instance context mode and what are the settings for concurrent calls/instances/session? – theburningmonk Oct 20 '11 at 09:31

3 Answers3

3

What you describe is that you want a scalability of a 100% that is a 1:1 relation between the increase in thread s and the decrease in wallcklock time... this is usally a goal but hard to reach...

For example you write that there is no memory contention because there is 1 GB free... this is IMHO a wrong assumption... memory contention means also that if two threads try to allocate memory it could happen that one has to wait for the other... another ponint to keep in mind are the interruptions happening by GC which freezes all threads temporarily... the GC can be customzed a bit via configuration (gcServer) - see http://blogs.msdn.com/b/clyon/archive/2004/09/08/226981.aspx

Another point is the WCF service called... if it can't scale up -for example the PDF rendering- then that is also a form of contention for example...

The list of possible contention is "endless"... and hardly always on the obvious areas you mentioned...

EDIT - as per comments:

Some points to check:

  • connection pooling
    what provider do you use ? how is it configured ?
  • PDF rendering
    possible contention would be measured somewhere inside the library you use...
  • Linq2SQL
    Check the execution plans for all these queries... it can be that some take any sort of lock and thus possibly create a contention DB-server-side...

EDIT 2:

Threads
Are these threads from the ThreadPool ? If so then you won't scale :-(

EDIT 3:

ThreadPool threads are bad for long-running tasks which is the case in your scenario... for details see

From http://www.yoda.arachsys.com/csharp/threads/printable.shtml

Long-running operations should use newly created threads; short-running operations can take advantage of the thread pool.

If you want extreme performance then it could be worth checking out CQRS and the real-world example described as LMAX .

Community
  • 1
  • 1
Yahia
  • 69,653
  • 9
  • 115
  • 144
  • Only going for a 50% speedup (20 threads, 10x speedup). I'm lucky in that the PDF rendering will soon be dropped from the process. Still, if threads are waiting on that pdf rendering there should be some measure-able resource that I could look at to determine it, right? I'll definately find some way to measure the GC in redgate. I'm not asking for the endless list, just the "next couple of items to check" list. – Amy B Oct 20 '11 at 05:27
  • to measure the resource regarding PDF rendering you would have to checkout something inside the library used for the rendering... another point is how your DB connections/connection pooling is handled - what provider/configuration do you use ? Also checkout the execution plans (you can do that with some DB-server-side tool) for all those Linq2SQL used – Yahia Oct 20 '11 at 05:33
  • I'd only have to look inside the library if it was locking, correct? If it was a memory hog I wouldn't have to look in the library - I'd look at a metric on the machine. I use the default connection pooling - whatever happens when LinqToSql's DataContext manages the connection. I live and breathe execution plans - that's already done. – Amy B Oct 20 '11 at 05:40
  • some PDF libraries are technically written so that they run only on one thread - I don't know devexpress... but I have tried a lot of PDF libraries and wouldn't bet that it is always a lock... since I never use Linq2SQL for performance-sensitive things I can't tell you much about connection pooling in this context... are you using SQL Server or Oracle or ? Are the threads you use from ThreadPool ? – Yahia Oct 20 '11 at 05:43
  • ThreadPool threads shouldn't be used for long-running tasks (like in this case)... see my EDIT and the links there... – Yahia Oct 20 '11 at 06:06
  • +1 for incredible microsoft support link about WCF thread creation. It also describes what performance metrics to watch - very helpful. – Amy B Oct 20 '11 at 06:15
2

Instead of measuring the total thread time, measure the time for each of the operations that you do that do I/O of some sort (database, disk, net, etc.).

I suspect you are going to find that these operations are the ones that take longer when you have more threads, and this is because the contention is on the other end of that I/O. For example, your database might be serializing requests for data consistency.

Miguel Grinberg
  • 65,299
  • 14
  • 133
  • 152
  • +1 for database lock contention. I wish I knew a good way to measure it. – Amy B Oct 20 '11 at 14:15
  • 1
    You can run your single thread test and measure all the individual db operations. Then add those together and let's say out of the 60 minutes a total of 5 were spent on db access. Now you know that there are 5 minutes that you will always have no matter how efficient you make your process. So you should think about this as an M+N kind of thing, with M being processing time, and N being time spent accessing shared resources that is outside of your control. You can improve M by multi-threading, but there is nothing you can do about N, that is always fixed. – Miguel Grinberg Oct 20 '11 at 15:51
2

yes, there's resource contention. All the threads have to read/write data to the same memory bus, directed to the same RAM modules, for example. It doesn't matter how much RAM is free, it matters that the reads/writes are carried out by the same memory controller on the same RAM modules, and that the data is carried over the same bus.

If there's any kind of synchronization anywhere, then that too is a contended resource. If there's any I/O, that's a contended resource.

You're never going to see a N x speedup when going from 1 to N threads. It's not possible because ultimately, everything in the CPU is a shared resource on which there will be some degree of contention.

There are plenty of factors preventing you from getting the full linear speedup. You're assuming that the database, the server the database is running on, the network connecting it to the client, the client computer, the OS and drivers on both ends, the memory subsystem, disk I/O and everything in between is capable of just going 20 times faster when you go from 1 to 20 threads.

Two words: dream on.

Each of these bottlenecks only has to slow you down by a few percent, then the overall result will be something like what you're seeing.

I'm sure you can tweak it to scale a bit better, but don't expect miracles.

But one thing you might look for is cache line sharing. Do threads access data that is very close to the data used by other threads? How often can you avoid that occurring?

jalf
  • 243,077
  • 51
  • 345
  • 550
  • I'm currently at N/4 speedup, looking for N/2 speedup. Threads are handed a request object with a batch of identifiers - from there they use those identifiers to load their own data to work with. – Amy B Oct 20 '11 at 14:12
  • Your point about DB IO, DB CPU, DB Memory, Network capacity, Network ping, Client CPU, Client Memory, Client disk IO going 20 times faster: already checked those things - they aren't at capacity. I don't expect them to go 20 times faster, I expect to use more of them by continually using all of them, instead of taking turns. They're just sitting there below capacity. Something else is at capacity and that's what I'm looking for. – Amy B Oct 20 '11 at 14:13