I'm profiling a multithreaded program running with different numbers of allowed threads. Here are the performance results of three runs of the same input work.
1 thread:
Total thread time: 60 minutes.
Total wall clock time: 60 minutes.
10 threads:
Total thread time: 80 minutes. (Worked 33% longer)
Total wall clock time: 18 minutes. 3.3 times speed up
20 threads
Total thread time: 120 minutes. (Worked 100% longer)
Total wall clock time: 12 minutes. 5 times speed up
Since it takes more thread time to do the same work, I feel the threads must be contending for resources.
I've already examined the four pillars (cpu, memory, diskIO, network) on both the app machine and the database server. Memory was the original contended resource, but that's fixed now (more than 1G free at all times). CPU hovers between 30% and 70% on the 20 thread test, so plenty there. diskIO is practically none on the app machine, and minimal on the database server. The network is really great.
I've also code-profiled with redgate and see no methods waiting on locks. It helps that the threads are not sharing instances. Now I'm checking more nuanced items like database connection establishing/pooling (if 20 threads attempt to connect to the same database, do they have to wait on each other?).
I'm trying identify and address the resource contention, so that the 20 thread run would look like this:
20 threads
Total thread time: 60 minutes. (Worked 0% longer)
Total wall clock time: 6 minutes. 10 times speed up
What are the most likely sources (other than the big 4) that I should be looking at to find that contention?
The code that each thread performs is approximately:
Run ~50 compiled LinqToSql queries
Run ILOG Rules
Call WCF Service which runs ~50 compiled LinqToSql queries, returns some data
Run more ILOG Rules
Call another WCF service which uses devexpress to render a pdf, returns as binary data
Store pdf to network
Use LinqToSql to update/insert. DTC is involved: multiple databases, one server.
The WCF Services are running on the same machine and are stateless and able to handle multiple simultaneous requests.
Machine has 8 cpu's.