Media samples held in graph for a long time (accumulative effect)

Question

Several months ago, I wrote this question, regarding buffer starvation on a DirectShow graph.

The starvation issue was solved by implementing a custom allocator that expands in size when starved. However, this merely mitigated the real problem; given enough time, the number of samples held in the graph becomes excessive and the ever expanding pool creates an out-of-memory situation.

Here are some facts I have managed to gather:

The graph is basically transcoding an MPEG2-TS stream to an MP4 file, as well as extracting audio and video data for some realtime DSP processing.
The stream comes as an UDP multicast stream. The stream is carrying 14 different SD programmes.
I am reading the UDP stream using a custom filter derived from the DsNetwork example. Following the aforementioned example, a media sample (with NO timestamps) is created around the UDP received data block (an 8KiB block) and passed to Microsoft's MPEG2 Demultiplexer filter, that is configured to filter the program of interest. (Should I be timestamping the samples?)
The filter that is requiring an expandable allocator is the MPEG2 Demultiplexer, in particular it is required for the samples delivered by the output video pin. The output audio pin works fine with a default allocator, no samples are retained by the audio decoder or the demuxer.
The video samples are being decoded by LAV Video Decoder. Swapping the LAV filter to ffdshow filter has no positive effect - the accumulation is still present. I have found no setting either in LAV or ffdshow (including the sample queue settings) that alleviates the accumulation problem.
The problem is completely related to the quality of the received stream. The more discontinuities detected on the stream (as flagged by the MPEG demuxer output samples), the more samples tend to be accumulated. Incidentally, running in parallel a VLC player consuming the same stream logs the same discontinuities, so they don't seem to be induced by buggy Network code on my part.
The lingering samples are not lost, they are eventually processed by the graph. I wrote some watchdog logic to detect the possibility of lost samples and every sample is eventually properly released and returned to the pool.
The lag is not related to CPU starvation. If I stop delivering samples to the demuxer, the demuxer stops delivering samples to the output pins. I NEED to push new samples into the demuxer for the lingering samples to be properly released and returned to the pool.
I tried removing the clock from the capture graph, as well as from the muxer graphs (bridged by a GDCL bridge filter). This does not fix the problem and can actually block the data flow.

I have no idea if the samples are being held by the demultiplexer or by the video decoder. The truth is that I am completely clueless on how can I debug and hopefully fix this situation, and any pointers or suggestions are more than welcome.

Addendum:

I have some additional information:

The transcoded video is lagging relative to the audio.
The lag time is proportional to the amount of lingering samples.

So I think that at some point in the graph processing, the decoded audio and video sample timestamps get out of sync, and probably the muxer endpoint of the graph is blocking the video decoding thread, waiting for the corresponding audio to arrive.

Any hints on how can I detect the offending filter, or perhaps how can I "rebase" the syncing?

Addendum2:

As you can see in the comments on Roman's answer, I had actually found a bug that induced false discontinuities on the stream. By fixing that bug I reduced the number of incidences of the problem, yet I did not fix the root cause!

It turns out that the root of the problem was caused by the Monogram AAC encoder filter (at least the version I managed to get, as it seems the project is not supported anymore).

The encoder computes the output timestamps incrementally, by multiplying the amount of received samples by the sampling frequency of the input. The filter assumes that the data Flow is always continuous and does not even examine the incoming samples for discontinuities!. Fixing it was easy once I identified the problem, but this was indeed the hardest problem I had to debug in all my life as a developer, as all the problems pointed to the MPEG2 demuxer (the timestamps drifted between the encoded output audio and video pins and it was this filter that was running out of pooled samples in the first place), yet, this was caused indirectly by the worker thread of the video output pin being blocked at the end of the graph, by the MPEG4 muxer, that was receiving way out of sync samples between audio and video and was throttling the video input to try to keep things in sync.

Indeed the illusion of the filters being "black boxes" needs to be taken with caution, as the threads flow along the graph, and a problem on a downstream filter may manifest as false problem in an upstream filter.

This is a tough one. A debugging suggestion could be to observe the `AddRef` and the `Release` methods, but since you don't have the source code, this won't bring much... — dymanoid, Apr 17 '19 at 11:21

score 2 · Accepted Answer · answered Apr 20 '19 at 17:55

First of all, the described behavior sounds like a bug. That is, unintended behavior causing unwanted effects. I agree, however, that attempts to work the problem around require identification of the offender and detailed investigation over registered problem.

Since video is lagging relatively audio in amount correlating to lingering samples, and there is no other side effect (like lost frames, for example) I agree that the challenge is in finding who holds the media samples exactly.

I can suggest two methods off the top of my head.

Inspection of memory allocators

This method is not so popular for the reasons I omit for brevity, however this still has good chances to not work. The background is that pin connections assume negotiation of memory allocator. The memory allocator is a private business of the pins so controlling application in most cases has no direct control (and eve access) over the data flow. More often each pin pair has their own allocator defines, however sometimes and not so rare multiple pin pairs use the same allocator. Note that it is output pin on a connection who has the final decision on the allocator to use.

If you happen to be familiar with my DirectShowSpy tool, one of the things it does is enumeration of memory allocators:

It can show the memory allocators, which connections share memory allocators and a snapshot of buffer count and free buffer count.

For brevity reasons I omit the situations where this is inaccurate.

Another important note is that this data is only available if you invoke spy UI from the process where DirectShow graph is running, as opposed to accessing filter graph remotely via Running Object Table.

This means that you are supposed to do the following:

register spy
have your application running (with the filter graph)
from the controlling thread (typically) IUnknown::QueryInterface for AlaxInfoDirectShowSpy::ISpy from your IGraphBuilder interface pointer
do ISpy::DoPropertyFrameModal to show the UI in question

You can obtain AlaxInfoDirectShowSpy::ISpy via #import of spy's type library. If spy is not registered via COM and it does not hook OS Filter Graph Manager object your QueryInterface in #3 above would fail.

From C# code (as you tagged the question respectively) you can import DirectShowSpy.dll as a COM reference.

Even though this method is not guaranteed to be working, it hasp good chances to show you the offender via visualization of memory allocator states and require some 10 lines of code to be inserted in your application.

Adding a temporary diagnostic filter to trace pin connection communication

Another method which has more chances to work out overall, but requires quite some writing code, is to develop and filter that transparently forwards data from input to output pin, such as CTransInPlaceFilter with logging media sample data somewhere to shared output. You might want to reuse GraphStudioNext's analyzer filter for this purpose in particular.

The idea is to attach this filter eas early as on demultiplexer output pins and monitor/log data as it travels from the filter downstream. Comparing timestamps on separate legs as the data is streamed you should be able to detect the violator. If you see the lag monitoring demultiplexer output pin connections, then demultiplexer is the offender. If things go rather well there you would move the tracing downstream, esp. over decoders and isolate he offender as you move the tracing filter.

Possible workarounds

Once violator is identified you will have to think of tricking it into releasing the media samples it holds, which in turn might be a challenge on its own. Having no other helpful information at this point, I would prepare to somehow drain it on the go by either sending end of stream notification, or flushing, or using dynamic media type negotiation in order to eventually force it to drain its internal queue.

Hi Roman! I wrote surely expecting input from you. Thank you very much. I still have not solved the problem but I now have a clear pathway to follow. I managed to substantially reduce the occurrences by tweaking the buffering at the Network Adapter level. With default settings and our heavy DSP and transcoding workloads, Windows seems unable to read network data fast enough, increasing packet-drop dramatically... — BlueStrat, Apr 22 '19 at 15:01
I saw nobody wants to comment, so I had to write an answer to not let you completely down :) — Roman R., Apr 22 '19 at 15:03
Increasing buffer sizes, queue counts and interrupt rate (by default the Network card coalesced multiple filled-buffer notifications into a single interrupt service to reduce CPU workload), it is now taking hours (rather than minutes or seconds) for a single instance of the problem to surface. Thanks again for your valuable input! — BlueStrat, Apr 22 '19 at 15:03
Hi Roman, applying your advice, I have in fact isolated Microsoft's MPEG-2 demuxer as the offender. Quick question, when the graph is operating normally (without desynchronization), the compressed video packets lead the compressed audio packets by ~ 2-3 seconds. I would have expected them to be much, much closer in time. Is this normal behavior for a muxed MPEG-2 TS stream? — BlueStrat, Apr 27 '19 at 16:06
The gap is a bit too large, but it's technically possible. Encoding side might have considerations to offset audio or its a side effect of encoding configuration. I suppose you an inspect TS with some other tool to check this mismatch (to make sure it's not coming from demultiplexing). If this offset is constant and is not going up and down, it's likely to be intentional, of course. — Roman R., Apr 29 '19 at 06:13

score 0 · Answer 2 · answered Apr 28 '19 at 01:38

Finally, I found the source of the problem.

After rewriting the UDP Reading code to use high-performance I/O (RIO), I wanted to get a metric on how many packets were being dropped. I implemented a very, very simple MPEG-TS continuity checker, and I found something really odd. I was not losing any packets, yet, the encoders were still flagging discontinuities. It made no sense at all!

After a thorough review, I found I had a reference counting problem in the Network buffers. I was apparently returning the TS packets to the pool early, while they were still being used by the demuxers. (The Network packet was shared on many graphs and I used reference counting to control the shared lifetime).

So, in essence, there was a race condition where the Network code could obtain a "free buffer" still in use by the demuxer and the data was being clobbered. My guess is that the demuxers found critical, unexplained errors that induced the sync loss.

Here's the continuity checker code, in case it is useful to someone having trouble with UDP multicast streams.

void MulticastMediaSample::Initialize(MulticastSourceFilter* pFilter, MulticastSourceFilter::UDPBuffer* pBuffer) {
   _props.pbBuffer = pBuffer->Data;
   _props.lActual = pBuffer->payloadSizeInBytes;
   _pBuffer = pBuffer;

   // Network packet should be a multiple of a TS packet length (188 bytes)
   int tsPacketCount = pBuffer->payloadSizeInBytes / 188;
   if( pBuffer->payloadSizeInBytes % 188 != 0 ) {
      printf("Invalid TCP packet, length is not multiple of 188\r\n");
      exit(-8828);
   }
   BYTE* pPacket = pBuffer->Data;
   UINT header;
   for( int i = 0; i < tsPacketCount; i++ ) {
      if( pPacket[0] != 0x47 ) {
         printf("Lost Sync!\r\n");
         exit(-12423);
      }
      UINT pId = (pPacket[1] & 0x1f) << 8 | pPacket[2];
      if( pId != 0x1fff ) {  // ignore "filler" packets
         UINT afc = (pPacket[3] & 0x30) >> 4;
         BYTE cc = pPacket[3] & 0xf;
         auto it = pFilter->_ccMap.lower_bound(pId);
         if( it != pFilter->_ccMap.end() && !(pFilter->_ccMap.key_comp()(pId, it->first)) ) {
            // PID key exists in map, check continuity
            if( afc != 2 ) {  // don't check for packets carrying no payload
               BYTE expected = (it->second + 1) & 0xf;
               if( cc != expected ) {
                  printf("Continuity check error for pId %d: expected %d, got %d\r\n", pId, expected, cc);
                  SetDiscontinuity(TRUE);
               }
            }
            // update key
            it->second = cc;
         } else {
            // key does not exist, insert first time 
            pFilter->_ccMap.insert(it, std::map<UINT16, BYTE>::value_type(pId, cc));
         }
      }
      pPacket += 188;
   }
#ifdef DEBUG
   ASSERT(pBuffer->payloadSizeInBytes <= sizeof pBuffer->DataCopy);
   memcpy(pBuffer->DataCopy, pBuffer->Data, pBuffer->payloadSizeInBytes);
#endif
   _pBuffer->AddRef();
   ASSERT(_refCnt == 1);
}

Media samples held in graph for a long time (accumulative effect)

2 Answers2

Inspection of memory allocators

Adding a temporary diagnostic filter to trace pin connection communication

Possible workarounds