9

I have a directory change monitor process that reads updates from files within a set of directories. I have another process that performs small writes to a lot of files to those directories (test program). Figure about 100 directories with 10 files in each, and about 500 files being modified per second.

After running for a while, the directory monitor process hangs on a call to fclose() in a method that is basically tailing the file. In this method, I fopen() the file, check that the handle is valid, do a few seeks and reads, and then call fclose(). These reads are all performed by the same thread in the process. After the hang, the thread never progresses.

I couldn't find any good information on why fclose() might deadlock instead of returning some kind of error code. The documentation does mention _fclose_nolock(), but it doesn't seem to be available to me (Visual Studio 2003).

The hang occurs for both debug and release builds. In a debug build, I can see that fclose() calls _free_base(), which hangs before returning. Some kind of call into kernel32.dll => ntdll.dll => KernelBase.dll => ntdll.dll is spinning. Here's the assembly from ntdll.dll that loops indefinitely:

77CEB83F  cmp         dword ptr [edi+4Ch],0 
77CEB843  lea         esi,[ebx-8] 
77CEB846  je          77CEB85E 
77CEB848  mov         eax,dword ptr [edi+50h] 
77CEB84B  xor         dword ptr [esi],eax 
77CEB84D  mov         al,byte ptr [esi+2] 
77CEB850  xor         al,byte ptr [esi+1] 
77CEB853  xor         al,byte ptr [esi] 
77CEB855  cmp         byte ptr [esi+3],al 
77CEB858  jne         77D19A0B 
77CEB85E  mov         eax,200h 
77CEB863  cmp         word ptr [esi],ax 
77CEB866  ja          77CEB815 
77CEB868  cmp         dword ptr [edi+4Ch],0 
77CEB86C  je          77CEB87E 
77CEB86E  mov         al,byte ptr [esi+2] 
77CEB871  xor         al,byte ptr [esi+1] 
77CEB874  xor         al,byte ptr [esi] 
77CEB876  mov         byte ptr [esi+3],al 
77CEB879  mov         eax,dword ptr [edi+50h] 
77CEB87C  xor         dword ptr [esi],eax 
77CEB87E  mov         ebx,dword ptr [ebx+4] 
77CEB881  lea         eax,[edi+0C4h] 
77CEB887  cmp         ebx,eax 
77CEB889  jne         77CEB83F 

Any ideas what might be happening here?

Haw-Bin
  • 416
  • 3
  • 8
  • 6
    My money is on a heap corruption. Take a look at http://stackoverflow.com/questions/1010106/how-to-debug-heap-corruption-errors – NPE May 23 '11 at 17:26
  • Can you post the relevant parts of your monitoring process's source code? – NPE May 23 '11 at 17:31
  • What parameters are you using in fopen? What filesystem are you using? Which OS Version? Which runtime library (MT/ST)? Is DCOM involved (Apartment Model)? – Jens May 23 '11 at 17:45
  • @Jens: Opening with "rb". Filesystem is NTFS, remote share. Processes are running on Windows 7. Runtime library is MD / MDd. DCOM is not involved. – Haw-Bin May 23 '11 at 18:24
  • @aix: Sorry, I can't post the code. Heap corruption does sound like a possibility; I'll look into it. – Haw-Bin May 23 '11 at 18:34
  • As you are using MD libraries, I think, you are deadlocking in a CriticalSection. Try using the single threaded libraries to avoid deadlocking. If the monitor program is simple enough, you can perhaps solve all concurrency issues for yourself. – Jens May 24 '11 at 12:36
  • @Jens - If it really is looping in the assembly code that was excerpted, I really doubt this has anything to do with a critical section. There are no atomic instructions here. It looks plausible that this assembly code is traversing a linked list, possibly some internal data structure for the heap. (A free list? A list of outstanding allocations?) – asveikau May 31 '11 at 23:05

2 Answers2

3

I posted this as a comment, but I realize this could be an answer in its own right...

Based on the disassembly, my guess is you've overwritten some internal heap structure maintained by ntdll, and it is looping forever iterating through a linked list.

In particular at the start of the loop, the current list node seems to be in ebx. At the end of the loop, the expected last node (or terminator, if you like -- it looks a bit like these are circular lists and the last node is the same as the first, pointer to this node being at [edi+4Ch]) is contained in eax. Probably the result of cmp ebx, eax is never equal, because there is some cycle in the list introduced by a heap corruption.

I don't think this has anything to do with locks, otherwise we would see some atomic instructions (eg. lock cmpxchg, xchg, etc.) or calls to other synchronization functions.

asveikau
  • 39,039
  • 2
  • 53
  • 68
  • It did turn out to be a heap corruption issue. aix pointed this out almost immediately, but I'll mark this as the answer for actually tracing through the disassembly. – Haw-Bin Jun 01 '11 at 13:27
0

I had a same case with file close function. In my case, I solved by located the close function embedded other function body instead of having own function.

I was also suspicious on (1) the name of file being duplicated (2) Windows scheduling (file IO wasn't completed before next task treading being started. Windows scheduling and multi-threading is behind of the curtain, so it is hard to verify, but I have similar issue when I tried to save many data in ASCII in the loop. Saving on binary solved at this case.)

My environment, IDE: Visual Studio 2015, OS: Windows 7, language: C++

Cloud Cho
  • 1,594
  • 19
  • 22