24

I'm doing some research on how to compare sound files(wave). Basically, I want to compare stored soundfiles (wav) with sound from a microphone. So in the end I would like to pre-store some voice commands of my own and then when I'm running my app I would like to compare the pre-stored files with input from the microphone.

My thought was to put in some margin when comparing because saying something two times in a row in the exactly same way would be difficult I guess.

So after some googling I see that Python has this module named wave and the Wave_read object. That object has a function named readframes(n):

Reads and returns at most n frames of audio, as a string of bytes.

What do these bytes contain? I'm thinking of looping thru the wave files one frame at the time comparing them frame by frame.

MarianD
  • 13,096
  • 12
  • 42
  • 54
Jason94
  • 13,320
  • 37
  • 106
  • 184
  • 3
    The bytes contain PCM data. Are you trying to do voice recognition? It sounds like you're in way over your head. You should research this topic. – JoshD Oct 18 '10 at 07:01
  • Ah, damn it then :) Thanks for the replies. You could call it voice recognition, but the way i thought about it was simple file compare which would be much simpler. In my case it would only be a matter of making the same sound, not analysing and try to interpret words – Jason94 Oct 18 '10 at 07:10
  • 2
    That's still voice recognition. Even a minor inflection or speed difference in your voice is going to give wildly different audio data so you can't just compare it frame by frame. – Soviut Oct 18 '10 at 07:11
  • Hmm... that was a bummer. Is there a python lib that does what I want then? – Jason94 Oct 18 '10 at 07:33
  • No but there are other libraries which have Python bindings. http://pypi.python.org/pypi/speech/0.5.2 if you are on Windows. If you are not: http://en.wikipedia.org/wiki/Speech_recognition_in_Linux – Lennart Regebro Apr 03 '11 at 08:05

4 Answers4

47

An audio frame, or sample, contains amplitude (loudness) information at that particular point in time. To produce sound, tens of thousands of frames are played in sequence to produce frequencies.

In the case of CD quality audio or uncompressed wave audio, there are around 44,100 frames/samples per second. Each of those frames contains 16-bits of resolution, allowing for fairly precise representations of the sound levels. Also, because CD audio is stereo, there is actually twice as much information, 16-bits for the left channel, 16-bits for the right.

When you use the sound module in python to get a frame, it will be returned as a series of hexadecimal characters:

  • One character for an 8-bit mono signal.
  • Two characters for 8-bit stereo.
  • Two characters for 16-bit mono.
  • Four characters for 16-bit stereo.

In order to convert and compare these values you'll have to first use the python wave module's functions to check the bit depth and number of channels. Otherwise, you'll be comparing mismatched quality settings.

Soviut
  • 88,194
  • 49
  • 192
  • 260
  • 1
    75 frames per second? Don't you mean 44100? – corvuscorax Apr 03 '11 at 07:34
  • Yes, I originally had that (see edits) but it has been modified on me. I'm going to change it back unless whoever is editing can explain their interpretation of a frame of audio. – Soviut Apr 05 '11 at 01:17
  • 1
    it might be some confusion stemming from the fact that Red Book CD players read 75 sectors from the disc per second, but that should be irrelevant for the purposes of this discussion – corvuscorax Apr 05 '11 at 09:38
  • I think python's most common "sound module" is `wave`: http://docs.python.org/2.7/library/wave.html but see http://docs.python.org/2.7/library/mm.html – n611x007 Apr 01 '13 at 07:02
  • 1
    "One character for an 8-bit mono signal" => One hexadecmial character = 8-bits? Surely you need two hexadecimal characters for a signal with 8-bit resolution. – user2316667 May 19 '14 at 19:19
  • @user2316667 As I recall, audio frames are stored as bytes, they're only represented as hex by the sound modules in Python. – Soviut May 19 '14 at 22:56
  • @user2316667 "1 character" in the `char` sense, not one printed character. – Andy V Jul 01 '14 at 15:32
  • So *Frames* is another term for Samples? – Echo Dec 07 '22 at 02:46
8

I believe the accepted description to be slightly incorrect.

A frame appears to be somewhat like stride in graphics formats. For interleaved stereo @ 16 bits/sample, the frame size is 2*sizeof(short)=4 bytes. For non-interleaved stereo @ 16 bits/sample, the samples of the left channel are all one after another, so the frame size is just sizeof(short).

bobobobo
  • 64,917
  • 62
  • 258
  • 363
8

A simple byte-by-byte comparison has almost no chance of a successful match, even with some tolerance thrown in. Voice-pattern recognition is a very complex and subtle problem that is still the subject of much research.

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • 2
    To add to this answer... the problem has a lot to do with how we typically represent audio digitally vs. how we perceive sound. We hear frequencies and their interaction. We don't directly perceive each rise and fall of a wave. Yet, when we capture audio digitally as PCM, we're just recording pressure level measurements thousands of times per second. We hear in the frequency domain, but PCM audio is in the time domain. To even begin to start to compare, we first need to run a Fourier transform to get our digital audio into the frequency domain. – Brad May 30 '16 at 00:39
6

The first thing you should do is a fourier transformation to transform the data into its frequencies. It is rather complex however. I wouldn't use voice recognition libraries here as it sounds like you don't record voices only. You would then try different time shifts (in case the sounds are not exactly aligned) and use the one that gives you the best similarity - where you have to define a similarity function. Oh and you should normalize both signals (same maximum loudness).

Konrad Höffner
  • 11,100
  • 16
  • 60
  • 118