0

I'm working on an app for language learning (cards with foreign words with pronunciation). I'm able to use voice over, but I would like to be able to import audio CD with native pronunciation. The problem is that the audio file for given part is not divided per word. Is there any way to detect gaps between them?

I managed to import the songs from the iPod library into the app folder, so I can use AVFoundation, etc. I think it is possible by processing individual samples, but I'm not sure how to do this. Any help would be appreciated.

Petr Holub
  • 75
  • 1
  • 8

2 Answers2

0

I finally managed to accomplish this task by processing individual audio samples. There are nice answers to another questions which really help understanding the way how to get all the audio information you need: AVAudioPlayer - Metering & Reading audio samples via AVAssetReader

You have to:

  1. Count the absolute value of each sample (float value of amplitude data)
  2. Ignore the noise (just set some tolerance which is suitable for your audio file)
  3. Iterate through the samples and save the position of audible signals

Be aware that samples are representation of a wave which goes through zero, so you need to analyse a few samples forward to see if there isn't another audio signal. The same applies to the noise which may sometimes peak above your tolerance.

Community
  • 1
  • 1
Petr Holub
  • 75
  • 1
  • 8
0

For each sound sample, s = samp[k], do:

fac = 0.01
tot = (1.-fac) * tot  +  fac * (s*s) 

This technique is a very basic form of low pass filter; it will give you a more realistic measure of the instantaneous energy.

Another light weight technique would be to box integrate the last hundred sample-magnitudes, also known as running average.

av = float[1000]
p=0

And then for each sample:

tot -= ring[p]
ring[p] = s*s
tot += ring[p]
p = p+1  %  1000

Another thing to look into would be leaky integrator.

You could also get away with processing only one in every 10 samples for example, this would still catch frequencies up to 2205Hz

P i
  • 29,020
  • 36
  • 159
  • 267