How to automatically transcribe a Skype meeting, correctly attributed to each participant?

Question

Assuming each participant agrees to the recording and transcription of the Skype call, is there a way to transcribe the meeting (either live or offline or both) such that it produces a text transcript where each spoken text is correctly attributed to the speaker. The transcript could then be input to any variety of search or NLP algorithms.

The top 3 Google search hits of "automatically transcribe Skype" refer to apps which make manual transcription easier:

(1) http://www.dummies.com/how-to/content/how-to-convert-skype-audio-to-text-with-transcribe.html

(2) http://ask.metafilter.com/231400/How-to-record-and-transcribe-Skype-conversation

(3) https://www.ttetranscripts.com/blog/how-to-record-and-transcribe-your-skype-conversations

While it would be trivial to record the audio and send it to a speech-to-text engine, I doubt it would be very high quality because the best results are usually speaker dependent models (else we wouldn't have to take time to train Dragon Naturally Speaking).

But, before we can choose speaker dependent transcription models, we need to know which segment of the audio belongs to which speaker. There's 2 ways that this is solved:

There is an easy way to retrieve all the audio that came from each participant, e.g. you just record all the audio from each speaker's microphone during the call, and you don't have to do any segmentation.
In case the first option isn't feasible or prohibitive in some way, we have to use a Speaker Diarization algorithm, which segments the audio into N clusters/speakers (most algorithms allow for being told how many speakers in the audio, but some can figure this out on their own). For real-time transcript as the call goes on, I imagine we'd need some fancy Real Time Speaker Diarization algorithm.

In any case, once the segmentation is solved, each participant has their trained speaker model, which is then applied to their portions of the audio. At the end of the day, everyone gets a nice conversation transcript and later one we can do fancy things like topic analysis or maybe Big Brother wants to sift over everyone's project meetings without having to listen to hours of audio.

My question is, what would be a way to implement this in practice?

If you had just one side access of a two-way call, you can take the microphone and full (two-way) audio recording and subtract one from the other. Skype for Business apparently hasn't added it as a feature yet: https://www.skypefeedback.com/forums/299913-generally-available/suggestions/11151633-voice-to-text-transcription — pds, Feb 21 '18 at 05:47
Welcome to Stackoverflow, I'll encourage you to ask the question here in the discussion instead https://stackoverflow.com/collectives/nlp/beta/discussions. Most probably the questions would be flagged as "asking for tool/fix recommendation" as it is now. — alvas, Aug 28 '23 at 11:54

How to automatically transcribe a Skype meeting, correctly attributed to each participant?

0 Answers0