mp4 files are multiplexed, meaning everything in them is subject to a lot of change. while the stream is being read (from file or remotely) the audio could be in a variety of formats, and or be available in multiple languages. worse still there is nothing to say if the parts will be in order or if the next part will even have audio.
perhaps it's best to avoid corrupting the boundaries and headers of each part, if you can. you could mitigate a lot of headache by using a precompiled mp4 library to transcode to another format where the audio and video are combined in a direct way, although corruption artifacts would behave very differently then.
please update your question to provide more of your findings, it sounds interesting!