0

In my Java-system, I have X persons, each person has Y strings, where Y >> X. I need to execute some complex calculations on each string. In order to boost the process, I run strings computing process in separate threads (threads number = CPU cores * 2). My question is should I put each person treatment in the separate thread too or it is enough to run only strings treatment in separate threads?

Should I execute person treatment in separate threads in additional to the thread-based strings computing? Or, because I'm already using the maximum optimal number of threads per number of CPU cores for strings treatment I will not benefit if will put the persons in the separate threads.

All persons are independent of each other. All person's strings are independent of each other.

Mike
  • 14,010
  • 29
  • 101
  • 161
  • 1
    `threads number = CPU cores * 2` how did you come to this formula? – corazza Dec 02 '14 at 19:30
  • http://stackoverflow.com/a/10670440/462347, especially: "the optimal number was equal to the number of cores in the machine". I use "* 2" to be sure I have enough threads for multi-core & multi-threading CPUs. If you have another advice, please, share your opinion. – Mike Dec 02 '14 at 19:34
  • 2
    You can not be sure what the best number is if you don't know how your threads behave. How frequent you complete tasks, whether they are IO or CPU bound, .. E.g. it makes no sense for CPU bound tasks to use more than exactly 1 thread per core, anything more will be slower. – zapl Dec 02 '14 at 19:41
  • They are CPU bound, calculation of some NLP-rank according to the string context. – Mike Dec 02 '14 at 19:43

4 Answers4

1

I think creating additional threads can slow down the processing, because of some additional overhead needed for new threads creation. But to be sure try to do an experiment. Try with different numbers of threads, then choose the optimal number.

P.S. Like other people in this topic I would recommend using thread pool for this task.

P.P.S. Consider using java.util.concurrent FixedThreadPool (launches n threads, if there are more tasks they are waiting for free thread) or CachedThreadPool (if there are more tasks creates new thread, otherwise reuses existing sleeping threads).

https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Executors.html#newFixedThreadPool(int) https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool()

Oleksandr Horobets
  • 1,335
  • 14
  • 26
0

I am first assuming that the threads are native threads (not green threads for performance reasons). There isn't really a performance consideration with passing references of objects into the thread other than making the gc continually skipping the reference for clean up which is more efficient than serializing/deserializing the object into the thread.

Long story short, you should avoid creating any unnecessary threads that exceed the hardware capacity if you know that the running threads have a high utilization rate (ie very rarely blocking on io/net/db/etc) otherwise you will force the cpu to perform a thread context switch which is very expensive.

anonymous
  • 81
  • 2
0

I would likely create a thread pool with a configurable size, which process a queue of person objects.

This allows then a thread to access, update and process an entire persons data without concerns of conflicts with other threads.

If there is IO within the process, you might be able to increase your thread pool size, or decrease it if over utilising the CPU.

Hope that helps

Chris
  • 11
  • 2
  • If you process on user basis, wouldn't it be shame that the `Thread` which finishes can't help the others? – Jan Zyka Dec 02 '14 at 19:49
  • Each thread in a pool will process an entire person then move onto the next, until all persons are complete. – Chris Dec 02 '14 at 19:51
  • Well ok, depends on how big the person task is then. Yeah, it's really unclear ... My first thought was the `Person` task will be too BIG... – Jan Zyka Dec 02 '14 at 19:53
  • If the processing of strings is the bulk of the work load, and you wish to process more strings across users, then you will want the thread pool to process the strings rather than an entire person. You will likely have 1 thread deciding which strings to process, i.e iterate users choosing the strings to process, push the strings on to the queue. If you need the workload to update back to the person object, you'll likey need to create a process record which contains the string and a link to the person object so it can perform any updates. – Chris Dec 02 '14 at 19:57
  • In my specific case, there are 150 persons, while each person has about 3500 strings, I have to proceed. It is Enron Email Dataset. – Mike Dec 02 '14 at 21:30
0

If processing each string takes in the order of 1µs or more, you should be fine putting each string processing in its own Runnable and pass that job to a ThreadPool with as many worker threads as you have logical CPUs. If they are faster, you should batch them so there is less overhead handling the job queue.

Ralf H
  • 1,392
  • 1
  • 9
  • 17