The general idea here is you want to pass your raw images through an encoder and encode the file that way. The encoder will take care of generating all your keyframes and intermediary (P and B) frames as well as generating any necessary decoding metadata that needs to be stored. On top of that running it through an encoding tool such as ffmpeg will also take care of saving the video file in a known container format and properly structuring your video headers. All of this is complicated and tedious to do by hand, not to mention error prone.
Whether you use ffmpeg or some other encoder it's up to you. I suggest using ffmpeg because it has the necessary functionality you need. If you want to do this all in code, ffmpeg is open source and you can wrap the pieces you need in a .net shell and call things that way. Though, keep in mind ffmpeg's licenses if you are developing a distributable application.
This should get you started: Making movies from image files using ffmpeg/mencoder
To add audio check this: https://stackoverflow.com/questions/1329333/how-can-i-add-audio-mp3-to-a-flv-just-video-with-ffmpeg
Now if you want to synchronize the audio and video (lets say the image sequence is people talking and the audio is their speech) you have a much more difficult problem on your hands. At this point you need to properly multiplex audio and video frames based on their durations. FFMpeg probably won't do that well since it will set each image in your video sequence to play at the same duration, which doesn't usually correlate properly with audio frames.