26

I have 15 audio tapes, one of which I believe contains an old recording of my grandmother and myself talking. A quick attempt to find the right place didn't turn it up. I don't want to listen to 20 hours of tape to find it. The location may not be at the start of one of the tapes. Most of the content seems to fall into three categories -- in order of total length, longest first: silence, speech radio, and music.

I plan to convert all of the tapes to digital format, and then look again for the recording. The obvious way is to play them all in the background while I'm doing other things. That's far too straightforward for me, so: Are there any open source libraries, or other code, that would allow me to find, in order of increasing sophistication and usefulness:

  1. Non-silent regions
  2. Regions containing human speech
  3. Regions containing my own speech (and that of my grandmother)

My preference is for Python, Java, or C.

Failing answers, hints about search terms would be appreciated since I know nothing about the field.

I understand that I could easily spend more than 20 hours on this.

Anil_M
  • 10,893
  • 6
  • 47
  • 74
Croad Langshan
  • 2,646
  • 3
  • 24
  • 37

8 Answers8

14

What you probably save you most of the time is speaker diarization. This works by annotating the recording with speaker IDs, which you can then manually map to real people with very little effort. The errors rates are typically at about 10-15% of record length, which sounds awful, but this includes detecting too many speakers and mapping two IDs to same person, which isn't that hard to mend.

One such good tool is SHoUT toolkit (C++), even though it's a bit picky about input format. See usage for this tool from author. It outputs voice/speech activity detection metadata AND speaker diarization, meaning you get 1st and 2nd point (VAD/SAD) and a bit extra, since it annotates when is the same speaker active in a recording.

The other useful tool is LIUM spkdiarization (Java), which basically does the same, except I haven't put enough effort in yet to figure how to get VAD metadata. It features a nice ready to use downloadable package.

With a little bit of compiling, this should work in under an hour.

Community
  • 1
  • 1
hruske
  • 2,205
  • 19
  • 27
5

You could also try pyAudioAnalysis to:

  1. Silence removal:

from pyAudioAnalysis import audioBasicIO as aIO from pyAudioAnalysis import audioSegmentation as aS [Fs, x] = aIO.readAudioFile("data/recording1.wav") segments = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow = 1.0, Weight = 0.3, plot = True)

segments contains the endpoints of the non-silence segments.

  1. Classification: Speech vs music discrimination: pyAudioAnalysis also includes pretrained classifiers, which can be used to classify unknown segments to either speech or music.
  • File "/Library/Python/2.7/site-packages/pyAudioAnalysis/audioFeatureExtraction.py", line 572, in stFeatureExtraction curFV[2] = stEnergyEntropy(x) # short-term entropy of energy File "/Library/Python/2.7/site-packages/pyAudioAnalysis/audioFeatureExtraction.py", line 48, in stEnergyEntropy subWindows = frame.reshape(subWinLength, numOfShortBlocks, order='F').copy() ValueError: cannot reshape array of size 1920 into shape (96,10) – baswaraj Apr 09 '17 at 12:22
  • unable to extract features – baswaraj Apr 09 '17 at 12:23
5

The best option would be to find an open source module that does voice recognition or speaker identification (not speech recognition). Speaker identification is used to identify a particular speaker whereas speech recognition is converting spoken audio to text. There may be open source speaker identification packages, you could try searching something like SourceForge.net for "speaker identification" or "voice AND biometrics". Since I have not used one myself I can't recommend anything.

If you can't find anything but you are interested in rolling one of your own, then there are plenty of open source FFT libraries for any popular language. The technique would be:

  • Get a typical recording of you talking normally and your grandmother talking normally in digital form, something with as little background noise as possible
    • Take the FFT of every second of audio or so in the target recordings
    • From the array of FFT profiles you have created, filter out any below a certain average energy threshold since they are most likely noise
    • Build a master FFT profile by averaging out the non-filtered FFT snapshots
    • Then repeat the FFT sampling technique above on the digitized target audio (the 20 hours of stuff)
    • Flag any areas in the target audio files where the FFT snapshot at any time index is similar to your master FFT profile for you and your grandmother talking. You will need to play with the similarity setting so that you don't get too many false positives. Also note, you may have to limit your FFT frequency bin comparison to only those frequency bins in your master FFT profile that have energy. Otherwise, if the target audio of you and your grandmother talking contains significant background noise, it will throw off your similarity function.
    • Crank out a list of time indices for manual inspection

Note, the number of hours to complete this project could easily exceed the 20 hours of listening to the recordings manually. But it will be a lot more fun than grinding through 20 hours of audio and you can use the software you build again in the future.

Of course if the audio is not sensitive from a privacy viewpoint, you could outsource the audio auditioning task to something like Amazon's mechanical turk.

Robert Oschler
  • 14,153
  • 18
  • 94
  • 227
3

if you are familiar with java you could try to feed the audio files throu minim and calculate some FFT-spectrums. Silence could be detected by defining a minimum level for the amplitude of the samples (to rule out noise). To seperate speech from music the FFT spectrum of a time-window can be used. Speech uses some very distinct frequencybands called formants - especially for vovels - music is more evenly distributed among the frequency spectrum.

You propably won't get a 100% separation of the speech/music blocks but it should be good enought to tag the files and only listen to the interesting parts.

http://code.compartmental.net/tools/minim/

http://en.wikipedia.org/wiki/Formant

Nikolaus Gradwohl
  • 19,708
  • 3
  • 45
  • 61
2

Two ideas:

  • Look in the "speech recognition" field, for example CMUSphinx
  • Audacity has a "Truncate silence" tool that might be useful.
Anders Lindahl
  • 41,582
  • 9
  • 89
  • 93
2

I wrote a blog article ago about using Windows speech recognition. I have a basic tutorial on converting audio files to text in C#. You can check out here.

mrtsherman
  • 39,342
  • 23
  • 87
  • 111
  • It looks like Wordpress garbled my code blocks at some point. I will try and fix them up this weekend. Rereading it though, if you want to roll your own speech processor I think this is a great place to start. – mrtsherman Apr 22 '11 at 18:46
  • Perhaps oddly, the approach of using speech recognition hadn't occurred to me, so thanks for triggering that thought even though your answer suggests using software that isn't open source (going on a quick search, there doesn't seem to be a mono implementation of System.Speech). – Croad Langshan Apr 22 '11 at 19:25
  • That's too bad. I wish I had an open source alternative for you! – mrtsherman Apr 22 '11 at 19:48
  • To add to the MS System.Speech thread: it is free, even though it isn't open source. Windows Vista and 7 include a free recognition engine that can be programmed through .NEt's System.Speech or with the C++ SAPI API. These engines include a dictation grammar which could be used to transcribe this text. See http://stackoverflow.com/questions/5467827/good-speech-recognition-api/5473407#5473407 for a short example and remember that you can call the SetInputToWaveFile method to read from audio files rather than the microphone. – Michael Levy Apr 22 '11 at 20:14
0

I'd start here,

http://alize.univ-avignon.fr/

http://www-lium.univ-lemans.fr/diarization/doku.php/quick_start

codeblocks:: is good for gcc

Kickaha
  • 3,680
  • 6
  • 38
  • 57
-2

Try audacity + view track as spectrogram(logf) + train your eyes(!) to recognize speech. You will need to tune time scale and FFT window.

eldarerathis
  • 35,455
  • 10
  • 90
  • 93
  • This is not as bad suggestion if you have a huge monitor and loads of screen space; and a human willing to stare at a screen. – Kickaha Nov 15 '17 at 09:16