Identifying segments when a person is speaking?

Question

Does anyone know a (preferably C# .Net) library that would allow me to locate, in voice recordings, those segments in which a specific person is speaking?

score 21 · Accepted Answer · edited Jun 09 '13 at 13:33

It's possible with the toolkit SHoUT: http://shout-toolkit.sourceforge.net/index.html

It's written in C++ and tested for Linux, but it should also run under Windows or OSX.

The toolkit was a by-product of my PhD research on automatic speech recognition (ASR). Using it for ASR itself is perhaps not that straightforward, but for Speech Activity Detection (SAD) and diarization (finding all speech of one specific person) it is quite easy to use. Here is an example:

Create a headerless pcm audio file of 16KHz, 16bits, little-endian, mono. I use ffmpeg to create the raw files: ffmpeg -i [INPUT_FILE] -vn -acodec pcm_s16le -ar 16000 -ac 1 -f s16le [RAW_FILE] Prefix the headerless data with little endian encoded file size (4 bytes). Be sure the file has .raw extension, as shout_cluster detects file type based on extension.
Perform speech/non-speech segmentation: ./shout_segment -a [RAW_FILE] -ams [SHOUT_SAD_MODEL] -mo [SAD_OUTPUT] The output file will provide you with segments in which someone is speaking (labeled with "SPEECH". Of course, because it is all done automatically, the system might make mistakes..), in which there is sound that is not speech ("SOUND"), or silence ("SILENCE").
Perform diarization: ./shout_cluster -a [RAW_FILE] -mo [DIARIZATION_OUTPUT] -mi [SAD_OUTPUT] Using the output of the shout_segment, it will try to determine how many speakers were active in the recording, label each speaker ("SPK01", "SPK02", etc) and then find all speech segments of each of the speakers.

I hope this will help!

Thank you for your answer, Marijn, and for your listing of steps! Is this language independent, i.e. can work in Hebrew, Japanese and so on? (surprisingly, these "probably chosen for this example because they are so exotic languages" are exactly the languages needed :) — Avi, Nov 28 '11 at 11:40
I just used this commands and terminal just freezed on a second command. Ubuntu 17.10. Maybe that's because I did nothing about this instruction: "Prefix the headerless data with little endian encoded file size (4 bytes)". Is this the case? How do I do that? — Roman, Nov 27 '17 at 15:21

Muhammad Ahmad Mujtaba · Answer 2 · 2016-12-11T17:08:00.367

2

While the above answer is accurate, I have an update to the installation issue occured to me on Linux while installing SHoUT. undefined reference to pthread_join whose solution I found was to open configure-make.sh from SHoUT installation zip and modify the line

CXXFLAGS="-O3 -funroll-loops -mfpmath=sse -msse -msse2" LDFLAGS="-lpthread" ../configure

to

CXXFLAGS="-O3 -funroll-loops -mfpmath=sse -msse -msse2" LDFLAGS="-pthread" ../configure

NOTE the lpthread to changed to pthread on Linux Systems.

OS: Linux Mint 18 where SHoUT version: release-2010-version-0-3

edited Dec 11 '16 at 17:08

answered Dec 11 '16 at 16:43

Muhammad Ahmad Mujtaba

75
7

Thanks for your answer, man! Did you eventually manage to successfully recognize something? – Roman Nov 27 '17 at 15:24
With SHoUT- no, I switched to Python as it had better support for audio analysis. – Muhammad Ahmad Mujtaba Nov 29 '17 at 12:43

Identifying segments when a person is speaking?

2 Answers2

Linked