0

I'm working on a project for a Discord bot, I would like to allow a bot to listen within a Discord channel and process voice commands.

I'm using an open source speech-to-text Java library called Sphinx (https://cmusphinx.github.io/). I'm receiving audio data from the Discord server via this https://github.com/DV8FromTheWorld/JDA library.

This class (https://github.com/DV8FromTheWorld/JDA/blob/master/src/main/java/net/dv8tion/jda/core/audio/AudioReceiveHandler.java#L65) is used for receiving audio. Method handleCombinedAudio(CombinedAudio audio) is called every 20 ms, and a byte[] of the audio data can be retrieved with audio.getBytes[].

The voice recognition software requires an InputStream of a byte array to properly recognize data. I have a method that concatenates byte arrays to form 3 sec chunks of sound, each which is processed by the voice recognition software. The problem I've run into is a mismatch of sound formats.

Sphinx requires RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz

Discord returns audio in: 48KHz 16bit stereo signed BigEndian PCM

How do I convert the received byte[] array from Discord into the proper format for Sphinx?

Any ideas would be greatly appreciated. Please be specific in answers.

widavies
  • 774
  • 2
  • 9
  • 22
  • You'd have to do sample rate conversion. You could find a library which does it, or you might be able to write something yourself since the numbers work out well (16 is a factor of 48). For something like simple voice recognition you might be able to get away with just deleting 2/3 of the samples. I'm not sure. The correct way would be to run it through a low-pass filter before truncating. I explained the PCM sample format [here](https://stackoverflow.com/q/26824663/2891664) if you aren't familiar with it and want to write something yourself. – Radiodef Jun 27 '17 at 04:41
  • Does a conversion between `BigEndian` and `LittleEndian` need to take place? @Radiodef – widavies Jun 27 '17 at 13:10
  • Yeah, but that's fairly simple to do. Just reorder the byte array by e.g. swapping `bytes[i]` with `[bytes[i+2]`. – Radiodef Jun 27 '17 at 13:12
  • Alright thanks! I think I got it figured out – widavies Jun 27 '17 at 14:04
  • I performed this same Discord -> Sphinx audio conversion in Clojure here from this thread. It may be useful to people in the future: https://github.com/Olical/snowball/blob/4672b24df120cad0f285ebd9882a9564bb9e823a/src/snowball/audio.clj#L16-L30 – Olical Aug 27 '18 at 22:40

0 Answers0