5

i'm trying to use tensorflowjs speech recognition in offline mode. online mode using microphone is working fine. but for offline mode i'm not able to find any reliable library for converting wav/mp3 file to spectrogram according to the required specs of array as ffttsize:1024 , columnTruncateLength: 232, numFramesPerSpectrogram: 43.

All libraries like spectrogram.js that i tried dont have those conversin options. while tensorlfowjs speech clearly mentions to have following specs for spectrograph tensor

const mic = await tf.data.microphone({
  fftSize: 1024,
  columnTruncateLength: 232,
  numFramesPerSpectrogram: 43,
  sampleRateHz:44100,
  includeSpectrogram: true,
  includeWaveform: true
});

Getting error as Error: tensor4d() requires shape to be provided when values are a flat array in following

await recognizer.ensureModelLoaded();
    var audiocaptcha = await response.buffer();
    fs.writeFile("./afterverify.mp3", audiocaptcha, function (err) {
        if (err) {}
    });
    var bufferNewSamples =  new Float32Array(audiocaptcha);

    const buffersliced = bufferNewSamples.slice(0,bufferNewSamples .length-(bufferNewSamples .length%9976));
    const xtensor = tf.tensor(bufferNewSamples).reshape([-1, 
...recognizer.modelInputShape().slice(1)]);

got this error after slicing and correcting to tensor

output.scores
[ Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ],
  Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ],
  Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ],
  Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ],
  Float32Array [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0 ] ]
score for word '_background_noise_' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word '_unknown_' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word 'down' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word 'eight' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word 'five' = 0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
score for word 'four' = undefined
score for word 'go' = undefined
score for word 'left' = undefined
score for word 'nine' = undefined
score for word 'no' = undefined
score for word 'one' = undefined
score for word 'right' = undefined
score for word 'seven' = undefined
score for word 'six' = undefined
score for word 'stop' = undefined
score for word 'three' = undefined
score for word 'two' = undefined
score for word 'up' = undefined
score for word 'yes' = undefined
score for word 'zero' = undefined

1 Answers1

3

The only requirement when working with offline recognition is to have an input tensor of shape [null, 43, 232, 1].

1 - Read the wav file and get the array of data

var spectrogram = require('spectrogram');

var spectro = Spectrogram(document.getElementById('canvas'), {
  audio: {
    enable: false
  }
});

var audioContext = new AudioContext();

readWavFile() {
return new Promise(resove => {
var request = new XMLHttpRequest();
request.open('GET', 'audio.mp3', true);
request.responseType = 'arraybuffer';

request.onload = function() {
  audioContext.decodeAudioData(request.response, function(buffer) {
    resolve(buffer)
  });
};
request.send()
})

}

const buffer = await readWavFile()

The same thing can be done without using the third party library. 2 options are possible.

  • Read the file using <input type="file">. In that case, this answer shows how to get the typedarray.

  • Serve and read the wav file using a http request

var req = new XMLHttpRequest();
req.open("GET", "file.wav", true);
req.responseType = "arraybuffer";

req.onload = function () {
  var arrayBuffer = req.response;
  if (arrayBuffer) {
    var byteArray = new Float32Array(arrayBuffer);
  }
};

req.send(null);

2- convert the buffer to typedarray

const data = Float32Array(buffer)

3- convert the array to a tensor using the shape of the speech recognition model

const x = tf.tensor(
   data).reshape([-1, ...recognizer.modelInputShape().slice(1));

If the latter commands fails, it means that the data does not have the shape needed for the model. The tensor needs to be sliced to have the appropriate shape or the recording made should respect the fft and other parameters.

edkeveked
  • 17,989
  • 10
  • 55
  • 93
  • the arraybuffer contains the signal in time domain not the spectogram @edkeveked – MAS Sep 30 '20 at 13:02
  • I think that what the OP meant was to get the tensor data from the signal wav file – edkeveked Sep 30 '20 at 13:54
  • yes, this is exactly what your answer addresses. But it partially solves his problem. The question indicated by his title "How to convert wav file to spectrogram for tensorflowjs with columnTruncateLength: 232 and numFramesPerSpectrogram: 43?" – MAS Oct 01 '20 at 05:50
  • 2
    Why is the answer incomplete ? If it does not solve your issue, maybe you can consider open a new thread with your question – edkeveked Oct 01 '20 at 07:38