I am doing quite some research on how I can separate music from an ad in order to get only the words mentioned in an ad. I have came across several approaches with librosa and pyaudio where it is discussed to set a high/low pass filter. I have tried this but the music remained in the ad.
Another approach I would dig in is speaker diarization. However, I do not know yet how to tackle the problem. There are some Deep Learning architectures available but they probably can't differentiate between music and non-music.
Does anyone has a better idea for this?
Cheers, Andi