This is the full command, provided an input video file, named vinput.mp4, and an audio file, named ainput.wav:
ffmpeg -y -i vinput.mp4 -i ainput.wav -filter_complex "\
[1:a]aloop=-1:2e+09[aloop];\
[aloop]aformat=channel_layouts=mono,atrim=end=2[atrim];\
[atrim]adelay=2000[afinal];\
[afinal]afade=t=out:st=3.5:d=0.5[afinal]" \
-shortest -map "[afinal]" -map 0:v output.mp4
Below is the explanation of each step:
[1:a]aloop=-1:2e+09[aloop]
Will loop my audio an infinite number of times (-1), and repeat the maximum number of frames possible (2e+09) (I didn't find a way t tell the command to just repeat the audio in full...
[aloop]aformat=channel_layouts=mono,atrim=end=2[atrim]
This takes the looped audio from the previous instruction (aloop) and mainly trims it to last only 2 seconds.
[atrim]adelay=2000[afinal]
This will delay the start of the audio by 2000 miliseconds (2 seconds), making it start playing at the second second of the video.
[afinal]afade=t=out:st=3.5:d=0.5[afinal]
This will fade out our audio, starting the fade out on the 4th second of the video, and making the fadeout last for 0.5 seconds, meanind the sound will fade out from second 3.5 to 4.
-shortest
Since we have an audio stream that is being repeated indefinitely, we need to tell ffmpeg when to stop the encoding, this is done by the tag, which tells ffmpeg that we expect out final result to last as long as the shortest input it received (in out case, the video input).
-map "[afinal]" -map 0:v
As Samuel Williams kindly pointed out in his answer, this option instructs ffmpeg what streams we want to copy to the output generated, in this case, we want to have both the final audio (afinal) and the source video stream (0:v).