step 1 just ignore the freq overlap for now ... and do the work to transform audio (time domain) using a FFT which will give you the data (freq domain) here you have an array of freq bins each with an amplitude and phase shift ... then feed this data into an inverse FFT to once again have your data as audio (time domain) ... a nice way to confirm your code is working is your audio out will match your audio in
step 2 after above code is working OK then enhance your above code to make value 0 the magnitude ( amplitude) of each freq bin (freq domain) which has no overlap ... its as easy as that
During step 2 your data is in the frequency domain ( after audio has been sent into an FFT call ) which is typically an array of complex numbers ... here is some pseudocode to parse this array
You will be hitting the challenge of wanting to use as few a number of audio samples as possible to obtain the greatest degree of temporal specificity as possible ( if you use too many audio samples your audio will sound like mush ) ... however if you use too few audio samples your granularity of freq bins will be low meaning you will have a greater increment between each frequency in your freq domain