Simulation thread and data writer thread parallelism

Question

This a general programming question. Let's say I have a thread doing a specific simulation, where speed is quite important. At every iteration I want to extract data from it and write it to a file.

Is it a better practice to hand over the data to a different thread and let the simulation thread focus on his job, or since speed is very important, make the simulation thread do the data recording too without any copying of data. (in my case it is 3-5 deques of integers with a size of 1000-10000)

Firstly it surely depends on how much data we are copying, but what else can it depend on? Can the cost of synchronization and copying be worth? Is it a good practice to create small runnables at each iteration to handle the recording task in case of 50 or more iterations per second?

50 x 10,000 = 500,000 per second. That's hardly much effort for the CPU if it's just about copying the references into another structure. — Marko Topolnik, Sep 06 '14 at 10:11
Why do you want to write the data while the simulation thread is running, any specific reason? — Benjamin Albert, Sep 06 '14 at 10:19
Because this way I wouldn't have to synchronize the access to the collections, it may be faster if the simulation thread does the copying and the write thread takes it from there. Maybe not.. I'm curious how most people do this, or what the general approach is. — Ryan Marv, Sep 06 '14 at 10:28

Chris K · Accepted Answer · 2014-09-06T12:40:45.700

3

If you truly want low latency on this stat capturing, and you want it during the simulation itself then two techniques come to mind. They can be used together very effectively. Please note that these two approaches are fairly far from the standard Java trodden path, so measure first and confirm that you need these techniques before abusing them; they can be difficult to implement correctly.

The fastest way to write the data to file during a simulation, without slowing down the simulation is to hand the work off to another thread. However care has to be taken on how the hand off occurs, as a memory barrier in the simulation thread will slow the simulation. Given the writer only cares that the values will come eventually I would consider using the memory barrier that sits behind AtomicLong.lazySet, it requests a thread safe write out to a memory address without blocking for the write to actually become visible to the other thread. Unfortunately direct access to this memory barrier is currently only availble via lazySet or via class sun.misc.Unsafe, which obviously is not part of the public Java API. However that should not be too large of a hurdle as it is on all current JVM implementations and Doug Lea is talking about moving parts of it into the mainstream.
To avoid the slow, blocking file IO that Java uses; make use of a memory mapped file. This lets the OS perform async IO for you on your behalf, and is very efficient. It also supports use of the same memory barrier mentioned above.

For examples of both techniques, I strongly recommend reading the source code to HFT Chronicle by Peter Lawrey. In fact, HFT Chronicle may be just the library for you to use here. It offers a highly efficient and simple to use disk backed queue that can sustain a million or so messages per second.

edited Sep 06 '14 at 12:40

answered Sep 06 '14 at 10:28

Chris K

11,622
1
36
49

Great answer, thanks, I'm gonna dig myself into this. Although would you just maintain one writer thread, or make a new one at every iteration with a quick purpose? Because if the writer thread wouldn't finish before the next iteration's data flow comes then its data writing tasks may queue up and further complications will occur. On the other hand if you create a new thread at each iteration they themselves can work in parallel. – Ryan Marv Sep 06 '14 at 11:15
The memory barrier business only makes sense if you don't mutate the data structure after publishing to the other thread, and that seems to be the reason why OP intended to copy the structure. So the most important advice would be to make the simulation thread 1. fill a structure, 2. hand it off to the writing thread, 3. create a new structure to fill with more data. – Marko Topolnik Sep 06 '14 at 11:23
1

Also, whether handing off to another thread actually helps performance is highly dependent on the CPU core count vs. the count of active threads on the JVM. A state-of-the art simulation would be expected to leverage multicore computing for its own purposes (data-parallel computation is usually dominant in simulations). – Marko Topolnik Sep 06 '14 at 11:26
1

@OP: if your writing thread falls behind the simulation, then your cause is lost because you can't improve I/O speed by throwing more CPU at it. – Marko Topolnik Sep 06 '14 at 11:28
1

On memory-mapped files: even regular, blocking I/O makes use of the system cache and the disk writes are performed asynchronously. – Marko Topolnik Sep 06 '14 at 11:29
@MarkoTopolnik a fair point, SSDs especially go async under the hood, that said memory mapping files in Java combined with direct memory access is 10x or more faster than going via OutputStream/Writers/Channel alternatives. – Chris K Sep 06 '14 at 12:00
@MarkoTopolnik you raise another good point regarding falling behind I/O speed. I was working on the assumption that the stats would be overwriting previous stats, as the JVM does in one of its status files. Thus changes that are lost in time may not make it to disk, which would be fine. In the case of it acting more as a log, then batching would kick in and only help up to a point. So good point. If the OP did mean a log of appending stats, then use of HFT Chronicle directly would be an excellent choice (and simpler than rolling ones own mechnanism. – Chris K Sep 06 '14 at 12:06
I think the speed comes not so much from memory mapping as from a better, lower-level implementation. Classic I/O does a lot of redundant buffer-to-buffer copying. We should better compare the performance of `Files.newByteChannel()` with it. – Marko Topolnik Sep 06 '14 at 12:07
@MarkoTopolnik and when you say a 'lot of redundant copy'; doesn't it just! :) – Chris K Sep 06 '14 at 12:10
Your note on the semantics of `lazySet` is very interesting... I have been studying the semantics of `volatile` and `lazySet` quite throroughly and actually could not find any specified difference. There is no guarantee from the Java Memory Model that, as soon as the `volatile` write completes, another thread is guaranteed to observe the write. The JMM gives no wall-clock guarantees of any kind, in fact. If you're interested in this, check out my [SO question on the topic](http://stackoverflow.com/questions/11761552/guarantees-given-by-the-java-memory-model). – Marko Topolnik Sep 06 '14 at 12:22
1

@RyanMarv I would not create more than one I/O thread, there would be no advantage to this (only cost); unless perhaps you were using multiple hard drives. One reason that I suggested memory mapped files, is that you may not actually need to create another thread yourself at all. The OS can do it for you via its paging mechanism. Could you confirm whether you are looking to create a journal of entries or just a single entry that gets over written. You may be best served by using HFT Chronicle and staying away from the implementation challenges, unless you are interested in them of course. – Chris K Sep 06 '14 at 12:45

score 3 · Answer 2 · answered Sep 06 '14 at 12:16

In my work on a stress-testing HTTP client I stored the stats into an array and, when the array was ready to send to the GUI, I would create a new array for the tester client and hand off the full array to the network layer. This means that you don't need to pay for any copying, just for the allocation of a fresh array (an ultra-fast operation on the JVM, involving hand-coded assembler macros to utilize the best SIMD instructions available for the task).

I would also suggest not throwing yourself head-on into the realms of optimal memory barrier usage; the difference between a plain volatile write and an AtomicReference.lazySet() can only be measurable if your thread does almost nothing else but excercise the memory barrier (at least millions of writes per second). Depending on your target I/O throughput, you may not even need NIO to meet the goal. Better try first with simple, easily maintainable code than dig elbows-deep into highly specialized APIs without a confirmed need for that.

+1 for "Better try first with simple, easily maintainable code .. without a confirmed need for that" — Chris K, Sep 06 '14 at 12:19

Simulation thread and data writer thread parallelism

2 Answers2