3

I'm developing an app which one scans thousands copies of a struct; ~1 GB RAM. Speed is important.

     ParallelScan(_from, _to);  //In a new thread

I manually adjust the threads count:

     if (myStructs.Count == 0) { threads = 0; }
     else if (myStructs.Count < 1 * Number.Thousand) { threads = 1; }
     else if (myStructs.Count < 3 * Number.Thousand) { threads = 2; }
     else if (myStructs.Count < 5 * Number.Thousand) { threads = 4; }
     else if (myStructs.Count < 10 * Number.Thousand) { threads = 8; }
     else if (myStructs.Count < 20 * Number.Thousand) { threads = 12; }
     else if (myStructs.Count < 30 * Number.Thousand) { threads = 20; }
     else if (myStructs.Count < 50 * Number.Thousand) { threads = 30; }
     else threads = 40;

I just wrote it from scratch and I need to modify it for another CPU, etc. I think I could write a smarter code which one dynamically starts a new thread if CPU is available at the moment:

  • If CPU is not %100 start N thread
  • Measure CPU or thread process time & modify/estimate N
  • Loop until scan all struct array

Is there anyone think that "I did something similar" or "I have a better idea" ?

UPDATE: The solution

    Parallel.For(0, myStructs.Count - 1, (x) =>
    {
         ParallelScan(x, x); // Will be ParallelScan(x);

    });

I did trim tons of code. Thanks people!

UPDATE 2: Results

Scan time for 10K templates

  • 1 Thread: 500 ms
  • 10 Threads: 300 ms
  • 40 Threads: 600 ms
  • Tasks: 100 ms
Nime Cloud
  • 6,162
  • 14
  • 43
  • 75
  • 1
    Unless you have a very, very, very, good reason to think that scanning these structs will take any more than a handful of microseconds and that really, really, really matters, it's not a good idea to do this kind of optimisation. If you really want to do it, you should have one thread per core. But really - don't. If it's just 50,000 structs and you're doing something simple with them, don't bother. – Kieren Johnstone Aug 02 '11 at 11:24
  • 3
    Additionally, do you have a constant called `Number.Thousand` that is defined as `1000`? Just use `1000` or call it something other than `Number.Thousand`, it seems very redundant ? – Kieren Johnstone Aug 02 '11 at 11:25
  • 1
    FYI, starting a new thread takes a good amount of time (a measurable part of a second, several milliseconds). – Kieren Johnstone Aug 02 '11 at 11:26
  • What is the .net version you are working with? – Jalal Said Aug 02 '11 at 11:27
  • @Kieren: It's Number.Thousand because it's easy to switch from Million to Thousand or Ten, plus easy to read. 10000 and 100000 looks similar in 22" IDE. – Nime Cloud Aug 02 '11 at 11:37
  • 1
    @Nime: by "easy to switch", you mean changing Number.Thousand to return 1e6, for example? – vgru Aug 02 '11 at 11:39
  • I need this optimization because I use AMD Phenom II X4 CPU to develop and test on Atom. Maybe tomorrow it will run on another CPU. – Nime Cloud Aug 02 '11 at 11:39
  • @Groo: What the hell is 1e6. I'm not astronaut :D You are right but it's still easiest way to me. – Nime Cloud Aug 02 '11 at 11:42
  • @Nime - if you change what `Number.Thousand` returns from `1000` then it shouldn't be called `Thousand` anymore. In which case, either refer to `1000` directly or rename `Number.Thousand` to be called what it actually represents - some multiplier constant or something. Don't call it `Thousand`! – Kieren Johnstone Aug 02 '11 at 11:43
  • 1
    If it's hard to read, use _. For example `int myVal = 1_000_000;` works just fine for a million – Kieren Johnstone Aug 02 '11 at 11:44
  • @Nime - you're missing the point re: your comment about 'different CPUs'. How long does this operation take? It's very unlikely that it's useful for you to optimize multithreading like this. It will give you the worst improvement. Better improvement will be gained by a better algorithm, or not having to depend on this weird invented multithreading scheme. – Kieren Johnstone Aug 02 '11 at 11:45
  • @Jalal: 3.5 / I can switch to 4.0 if necessary. – Nime Cloud Aug 02 '11 at 11:46
  • 1
    @Nime: if you can switch to 4.0, then simply use TPL. – vgru Aug 02 '11 at 11:50
  • @Nime: also, you could get additional recommendations if you roughly describe how your data is organized. Search algorithms depend heavily on the way data is organized. – vgru Aug 02 '11 at 11:53
  • 1
    Like @Groo and @Henk suggestion I would make the move to 4.0 and use the parallel library. Note that when you use for example `AsParallel()` will automatically get your processors count and other considerations and just create threads as needed, also you can tune it for your requirements.. – Jalal Said Aug 02 '11 at 17:11

3 Answers3

5

The standard answer: Use Tasks (TPL) , not Threads. Tasks require Fx4.

Your ParallelScan could just use Parallel.Foreach( ... ) or PLINQ (.AsParallel()).

The TPL framework includes a scheduler, and ForEach() uses a partitioner, to adapt to CPU cores and load. Your problem is most likely solved with the standard components but you can write custom-schedulers and -partitioners.

H H
  • 263,252
  • 30
  • 330
  • 514
  • Plus "...The creation time of Threads to do the same job as the Tasks is far greater:" http://www.codeproject.com/KB/cs/TPL1/ThreadsVTasks.png – Nime Cloud Aug 02 '11 at 20:05
  • @Nime yeah, I forgot to mention that, thought it was obvious. TPL runs on top of the ThreadPool. – H H Aug 04 '11 at 07:38
3

Actually, you won't get much benefit from spanning 50 threads, if you CPU only has two cores (even if each of them supports hyperthreading). If will actually run slower due to context switching which will occur every once in a while.

That means you should go for the Task Parallel Library (.NET 4), which takes care that all available cores are used efficiently.

Apart from that, improving the asymptotic duration of your search algorithm might prove more valuable for large quantities of data, regardless of the Moore's law.

[Edit]

If you are unable/unwilling to use .NET 4 TPL, you can start by getting the information about the current number of logical processors in the system (use Environment.ProcessorCount or check this answer for detailed info). Based on that number, you can partition your data and span a fixed number of threads. That is much simpler that checking the CPU utilization, and should prevent creating unnecessary threads which are starved anyway.

Community
  • 1
  • 1
vgru
  • 49,838
  • 16
  • 120
  • 201
2

OK, sorry to keep going on but first to compile my comments:

  • Unless you have a very, very, very, good reason to think that scanning these structs will take any more than a handful of microseconds and that really, really, really matters, it's not a good idea to do this kind of optimisation. If you really want to do it, you should have one thread per core. But really - don't. If it's just 50,000 structs and you're doing something simple with them, don't bother.
  • FYI, starting a new thread takes a good amount of time (a measurable part of a second, several milliseconds).
  • How long does this operation take? It's very unlikely that it's useful for you to optimize multithreading like this. It will give you the worst improvement. Better improvement will be gained by a better algorithm, or not having to depend on this weird invented multithreading scheme.

I'm confused about your performance fixation partly because you say you're looking through 50,000 structs (a very quick and easy operation) and partly because you're using structs. Without boxing that's a value type and if you're passing them around threads you're copying data rather than references, i.e. using more memory. My point being that that's a lot of data/memory, unless the structs are small, in which case, what kind of processing can you possibly be doing on them that takes so long as to think about 40+ threads in parallel?

If performance is truly incredibly important and your goal, and you're not simply trying to do this as a nice engineering exercise, please share information about what kind of processing you're doing.

Kieren Johnstone
  • 41,277
  • 16
  • 94
  • 144
  • It's face recognition process. Clientside: A webcam takes a shot, extracts face, generates a 15KB "template". Serverside: App scans every single enrolled template (struct) for a match. Now I have 30 people, 20 average templates per person makes 500 total templates. We aim 5000 people/server. – Nime Cloud Aug 02 '11 at 13:11
  • Currently server responses in 50ms in our LAN. I duplicated more than 50K templates where scan time was > 3 seconds. – Nime Cloud Aug 02 '11 at 13:19
  • 1
    That is an awesome example of an actual usable case for what you're asking, many thanks indeed :) Facial recognition processing is processor-intensive and **well worth** considering scaling out as you are. Out of interest are you going for the perceptron NN route, doing any matching/alignment/deskewing, or just comparing pixel-by-pixel? – Kieren Johnstone Aug 02 '11 at 13:41
  • @Nime: one clarification: each set of "templates" is sent to the server from a web client. By 50ms LAN response, are you accounting for the entire roundtrip? Does this 3s then correspond to actual processing time or time to finish a bunch of roundtrips? Also, you do know that every web request gets its own thread from the threadpool in ASP.NET? – vgru Aug 03 '11 at 07:56
  • @Groo: 50ms includes send & receive time. As you see it's not so important when the scan time ~3 seconds. One thread in 3 seconds is not important either. I only send one alive template to server, then server looks at template array/archive -where the tons of struct reside- for a match. Thanks anyway. – Nime Cloud Aug 03 '11 at 10:29