11

I have two audio files in which a sentence is read (like singing a song) by two different people. So they have different lengths. They are just vocal, no instrument in it.

A1: Audio File 1
A2: Audio File 2
Sample sentence : "Lorem ipsum dolor sit amet, ..."

structure of sample audio files

I know the time every word starts and ends in A1. And I need to find automatically that what time every word starts and ends in A2. (Any language, preferably Python or C#)

Times are saved in XML. So, I can split A1 file by word. So, how to find sound of a word in another audio that has different duration (of word) and different voice?

Kadir Şahbaz
  • 418
  • 7
  • 21

4 Answers4

5

So from what I read, it seems you would want to use Dynamic Time Warping (DTW). Of course, I'll leave the explanation for wikipedia, but it is generally used to recognize speech patterns without getting noise from different pronunciation.

Sadly, I am more well versed in C, Java and Python. So I will be suggesting python Libraries.

  1. fastdtw
  2. pydtw
  3. mlpy
  4. rpy2

With rpy2 you can actually use R's library and use their implementation of DTW in your python code. Sadly, I couldn't find any good tutorials for this but there are good examples if you choose to use R.

Please let me know if that doesn't help, Cheers!

Haris Nadeem
  • 1,322
  • 11
  • 24
2

My approach for this would be to record the dB volume at a constant interval (such as every 100 milliseconds) store this volume in a list or array. I found a way of doing this on java here: Decibel values at specific points in wav file. It is possible in other languages. Meanwhile, take note of the max volume:

max = 0;
currentVolume = f(x)
if currentVolume > max
{
  max = currentVolume
}

Then divide the maximum volume by an editable threshold, in my example I went for 7. Say the maximum volume is 21, 21/7 = 3dB, let's call this measure X.

We second threshold, such as 1 and multiply it by X. Whenever the volume is greater than this new value (1*x), we consider that to be the start of a word. When it is less than the given value, we consider it to be the end of a word.

Visual explanation

Martin Meli
  • 452
  • 5
  • 18
  • These are arbitrary words. Normally, most of the words don't have any spaces between them. So, the words are hard to distinguish from each other using dB. – Kadir Şahbaz Mar 30 '18 at 20:20
2

Without knowing how sophisticated your understanding of the problem space is it isn't easy to know whether to point you in a direction or provide detail on why this problem is non-trivial. I'd suggest that you start with something like https://cloud.google.com/speech/ and try to convert the speech blocks to text and then perform a similarity comparison on these. If you really want to try to do the processing yourself you could look at doing some spectrographic analysis. Take the wave form data and perform an FFT to get frequency distributions and look for marker patterns that align your samples. With only single word comparison of different speakers you are probably not going to be able to apply any kind of neural network unless you are able to train them on the 2 speakers entire speech set and use the network to then try to compare the individual word chunks. It's been a few years since I did any of this so maybe it's easier these days but my recollection is that although this sounds conceptually simple it might prove to be more difficult than you realise. The Dynamic Time Warping looks like the most promising suggestion.

Peter Scott
  • 1,318
  • 12
  • 19
  • I have considered that. But the words are not in English and the record is not a speech. So, converting to text doesn't seem as an option for my problem. It needs, I think, fingerprinting of sound parts and searching them in other audio. But other problem is that the readers are different. DTW may help to solve the problem. I'll try it. – Kadir Şahbaz Mar 31 '18 at 10:21
2

secret sauce of below : pointA - pointB is zero if both points have same value ... that is numerically do a pointA minus pointB ... below leverages this to identify at what file byte index offset gives us this zero value when comparing the raw audio curves from a pair of input files ... or an close to zero in a relative sense if both source audio are different even slightly

approach is open up both files and pluck out the raw audio curve of each file ... define two variables bestSum and currentSum, set both to MAX_INT_VALUE ( any arbitrary high value ) ... iterate across the both files simultaneously and obtain the integer value of the current raw audio curve level of file A do same on other file B ... for each such integer just subtract the integer from file A from integer from file B ... continue this loop until you have reached end of one file ... inside of above loop add to currentSum variable the current value of the above mentioned subtraction ... at bottom of above loop update bestSum to become currentSum if currentSum < bestSum also store current file index offset ...
create an outer loop which does a repeat all of above by introducing an offset in time of one file then relaunch above inner loop ... your common audio is when you are using the offset which has the minimum total sum value .. that is the offset when you encountered bestSum

do not start coding until you have gained intuition that above makes perfect sense

I highly encourage you to plot out the curve of the raw audio for one file to confirm you are accessing this sequence of integers ... do this before attempting above algorithm

it will help to visualize above by viewing each input source audio as a curve and you simply keep one curve steady as you slide the other audio curve left or right until you see the curve shapes match or get very close to matching

Scott Stensland
  • 26,870
  • 12
  • 93
  • 104