Idendify the reason for a 200 ms freezing in a time critical loop

Question

New description of the problem:

I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?

Original question:

In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU @3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).

My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.

My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?

Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.

How do you switch the buffers? Also, you talk about guessing that this is the issue, have you profiled it? — Tony The Lion, Nov 12 '14 at 12:56
Why don't you use one whopping big ring-buffer? And why no asynchronous writing? — Deduplicator, Nov 12 '14 at 12:58
What exactly are you using double buffer for? You have [deque] (http://www.cplusplus.com/reference/deque/deque/), one thread pushes records in the end (push_back), and other threads reads from front (pop_front). — user1, Nov 12 '14 at 12:58
@user3924882: That's not thread-safe though. And if you lock around pop/push, what about re-allocation? — Deduplicator, Nov 12 '14 at 13:00
yes, of course every call to must be guarded. what re-allocation are you talking about? I didn't get you. — user1, Nov 12 '14 at 13:03
@user3924882: Problem is, your scheme would lead to many calls to the allocator. Unless, of course, a second dequeue is used for enqueueing buffers for re-use. Ok, then we have a dynamically growing ring-buffer. — Deduplicator, Nov 12 '14 at 13:07
Are your buffers of fixed size? 200ms is not a Core affinity problem unless the other core was in a deep power save state. It could be paging because you did perhaps copy the buffer in some method by error which could cause such things. — Alois Kraus, Nov 12 '14 at 13:08
@Deduplicator: If I do asynchronous writing, how would this be different from what I'm doing now? Again, I would call a second thread which does the saving. — Paul R., Nov 12 '14 at 13:23
You let the system handle writing at its own speed, and you don't need any threads for that. Use completion notification. Actually, doing so you can reduce your application to being single-threaded. — Deduplicator, Nov 12 '14 at 13:25
@Deduplicator: Ok, I get what you mean. Unfortunately, I can't do it this way, since I'm writing to hdf5 files and have to use the corresponding API. As far as I know, it is not possible to use this API asynchronously. Another problem is, that I collect additional information from other parts of the software package which all come together in the writing thread. I need to combine the collected data to this information before writing it to disc. — Paul R., Nov 12 '14 at 13:34
Well, than it looks like you really need two dequeue's, one for full buffers and one for empties. Unless re-allocation does not hurt too much, in which case one is enough. — Deduplicator, Nov 12 '14 at 13:36
@AloisKraus: My buffers are de facto of fixed size, since I initialize them with a size which is big enough to hold the data they typically collect. I verified that they do indeed not collect more data than they are initialized for. I also verified that the buffer has not been copied to anywhere. — Paul R., Nov 12 '14 at 13:44
@Deduplicator: The initial implementation has been with two dequeues. It showed the same effect plus re-allocation overhead. — Paul R., Nov 12 '14 at 13:46
@TonyTheLion: The buffer consist of two std::vectors which are accessed as ring buffers. Each time data is written/read to the buffer, a switch is called which directs the writing/reading to the active /write/read-buffer. Switching the buffer just means, that the bool value tested in these switches is toggled. — Paul R., Nov 12 '14 at 13:48
Ok. Then don't use dequeues, but a ring-buffer, using `std::vector` probably. — Deduplicator, Nov 12 '14 at 13:48
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/64789/discussion-between-paul-r-and-deduplicator). — Paul R., Nov 12 '14 at 14:00
You might want to look at this (or search/write your own): http://stackoverflow.com/questions/19059336/c-threadsafe-ringbuffer-implementation — Deduplicator, Nov 12 '14 at 14:34

Izzy · Answer 1 · 2014-11-14T17:47:20.083

2

Locking and releasing a mutex just to switch one bool variable will not take 200ms.

Main problem is probably that two threads are blocking each other in some way. This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.

For further reading I recommend this article for a read, which describes lock contention with more detailed level.

edited Nov 14 '14 at 17:47

answered Nov 13 '14 at 20:31

Izzy

402
6
16

Whilst this may theoretically answer the question, [it would be preferable](http://meta.stackoverflow.com/q/8259) to include the essential parts of the answer here, and provide the link for reference. – Mgetz Nov 13 '14 at 20:45

score 1 · Answer 2 · answered Nov 20 '14 at 13:18

Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.

VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)

then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"

The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck

Paul R. · Accepted Answer · 2016-08-02T11:18:24.170

After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:

Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.

Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

Idendify the reason for a 200 ms freezing in a time critical loop

3 Answers3