I'm searching for information on the best way to recognize duplicate segments of speech in an audio file.
Let's say somebody is recording himself saying a text, sometime he will choke on a sentence, stop, and start again from the begging of the sentence. He may also do two or three takes of the same part in order to keep the best one in the final editing.
So my question is: what is the best way to detect those segments as being the same, or being variation around the same text ?
What I'm thinking is doing some speech-to-text and then doing some text comparison on the result. I would be able to identify strings that are really close and then tag the corresponding audio segments as being "the same".
But I was wondering if there is some way to do this directly on the audio file. I heard about audio fingerprinting but I'm not sure it will work here because the person may not pronounce the two sentences exactly the same way (adding silences or even slightly changing some words).
Does anybody already did something similar, or have used those tools and can give me feedback on their possibilities and limitation ?