2

At work we run our unit tests regularly on several build machines with the help of Jenkins. One of the unit tests opens a file with FILE_FLAG_WRITE_THROUGH, and makes tens of thousands of WriteFile calls on this file. Why it does that is not relevant to this question. On just one of the build machines, this test timed out constantly. After some investigation, it turned out the root cause for this was that each call to WriteFile was blocking (the thread was switched out, and switched in later), making the test run substantially slower. Curiously enough, when the test was started manually on this machine, WriteFile did not block, and thus the test did not time out.

After hours of trial and error, I found a "workaround": on the affected machine if Jenkins was started with Task Scheduler, WriteFile blocked, if it was started manually or by placing a BAT file in the Startup folder, it did not. I thought I needed to investigate more, so I created a minimal repro, a small program that opens a file with FILE_FLAG_WRITE_THROUGH, and calls WriteFile on it with random data 10000 times. With this minimal repro, WriteFile blocks constantly, regardless of how I launch it. I'm really confused, so here I am asking for any explanation of this behavior. All the machines I performed tests on had SATA drives, with default settings (write caching and Windows write-cache flushing enabled).

First of all for me, it's not even 100% clear what FILE_FLAG_WRITE_THROUGH should do. Information on the internet states that a special flag is passed down to the device, asking it to flush its cache after the write. According to Raymond Chen (and a bunch of other people), this is mostly irrelevant nowadays, as the majority of SATA drives silently ignore this flag due to performance reasons. On the "should or should not block" side, this page clearly states that with FILE_FLAG_WRITE_THROUGH WriteFile should block:

The write call doesn't return until the data is written to the file.

Do you have any ideas why I'm seeing this strange behavior (inconsistently)? How can I investigate further?

Donpedro
  • 828
  • 7
  • 22
  • It bypasses the file system cache. Gets you to discover how expensive disk writes really are. When it has to compete with any other processes using the disk (reads and writes) then you induce a lot of head seeks, the most expensive thing you can ever do. The cache optimizes this *heavily*, ordering the writes in cylinder order. And happens in the background so you never notice. Nothing special going on, you have to increase the timeout. – Hans Passant Jun 11 '16 at 18:32
  • I would guess that if there is no other thread ready to run, Windows will remain in your thread's context, so your thread will start again as soon as the operation has completed, rather than needing to wait to be allocated another timeslice. So I would guess that the "odd machine out" had something else running that was CPU-bound, or something along those lines. IIRC foreground tasks get slightly higher priority than background ones, which could explain the difference when you run it manually - if your thread has a higher priority, the other one will be bumped as soon as the I/O complete. – Harry Johnston Jun 11 '16 at 22:56
  • ... if so, explicitly boosting the thread priority might help. But it would still happen occasionally, whenever something with a still-higher priority is running, so as Hans already suggested you probably need to write your tests to cope. – Harry Johnston Jun 11 '16 at 22:59
  • @HarryJohnston I tried running the unit test on my machine (where I could never get `WriteFile` with this particular unit test to block) with lowered priority, and started a heavyweight, parallel compilation with high priority that runs 8 workers with heavy CPU load (there are 8 logical cores on the machine). The unit test ran substantially slower (still no timeout though), but `WriteFile` did not block even in this setup (confirmed with WPA). – Donpedro Jun 12 '16 at 08:49
  • How are you determining whether the call blocks or not? And just to eliminate any doubt, you're not using asynchronous I/O? – Harry Johnston Jun 12 '16 at 10:00
  • @HarryJohnston I'm not using asynchronous IO. I can see whether the call blocks or not by recording an ETW trace of the run, and checking the data of CPU Usage Precise (== ETW events of context switches emitted by the scheduler). – Donpedro Jun 12 '16 at 10:28
  • *"I created a minimal repro, a small program"* - It would be helpful if you posted that code, to eliminate the chance of a bug in your code, or even just so that everyone is on the same page. – IInspectable Jun 12 '16 at 10:51
  • @IInspectable: if I've read the OP correctly, the minimal version didn't successfully reproduce the unexpected behaviour. I guess the next step is for the OP to figure out what the difference is, by process of elimination. – Harry Johnston Jun 12 '16 at 21:29
  • Another thought: are you sure the testing environment isn't hooking WriteFile or CreateFile? And there's no anti-virus in play? – Harry Johnston Jun 12 '16 at 21:32

0 Answers0