1

I'm trying to get the pitch from the microphone input. First I have decomposed the signal from time domain to frequency domain through FFT. I have applied Hamming window to the signal before performing FFT. Then I get the complex results of FFT. Then I passed the results to Harmonic product spectrum, where the results get downsampled and then multiplied the downsampled peaks and gave a value as a complex number. Then what should I do to get the fundamental frequency?

    public float[] HarmonicProductSpectrum(Complex[] data)
    {
        Complex[] hps2 = Downsample(data, 2);
        Complex[] hps3 = Downsample(data, 3);
        Complex[] hps4 = Downsample(data, 4);
        Complex[] hps5 = Downsample(data, 5);
        float[] array = new float[hps5.Length];

        for (int i = 0; i < array.Length; i++)
        {
            checked
            {
                array[i] = data[i].X * hps2[i].X * hps3[i].X * hps4[i].X * hps5[i].X;
            }
        }
        return array;
    }

    public Complex[] Downsample(Complex[] data, int n)
    {
        Complex[] array = new Complex[Convert.ToInt32(Math.Ceiling(data.Length * 1.0 / n))];
        for (int i = 0; i < array.Length; i++)
        {
            array[i].X = data[i * n].X;
        }
        return array;
    } 

I have tried to get the magnitude using,

    magnitude[i] = (float)Math.Sqrt(array[i] * array[i] + (data[i].Y * data[i].Y));  

inside the for loop in HarmonicProductSpectrum method. Then tried to get the maximum bin using,

        float max_mag = float.MinValue;
        float max_index = -1;

        for (int i = 0; i < array.Length / 2; i++)
            if (magnitude[i] > max_mag)
            {
                max_mag = magnitude[i];
                max_index = i;
            }

and then I tried to get the frequency using,

    var frequency = max_index * 44100 / 1024;

But I was getting garbage values like 1248.926, 1205,859, 2454.785 for the A4 note (440 Hz) and those values don't look like harmonics of A4.

A help would be greatly appreciated.

Giggity
  • 54
  • 1
  • 7
  • Can you put the audio file you're using somewhere online? I'd like to take a look at this, replicating your algorithm in a high-level language like Python, since most likely you have an algorithmic problem, but I'd like to make sure I'm using the same "A4 note" as you (I don't know anything about music ). – Ahmed Fasih Aug 30 '16 at 15:13
  • Double check your microphone's sample rate in the windows settings, it could be e.g., `48,000` and not the `44,100` you have on the last line, which turns your 1249 number into 1360, which is only 8% off of a harmonic instead of 17%.. *shrug*, doesn't seem right but it's something to check. – Quantic Aug 30 '16 at 15:18
  • @AhmedFasih [A4 note](https://ufile.io/0d4f4) Thanks Ahmed :) – Giggity Aug 30 '16 at 15:25
  • @Quantic Yeah, it's 44100. I know the FFT is correct because once I get the frequency of the max bin using the magnitude it gives 437.5Hz and 445.312Hz for the A4 note which is 440Hz. I wanted to perform HPS to get a better frequency resolution – Giggity Aug 30 '16 at 15:31
  • Quote: "Octave errors are common (detection is sometimes an octave too high). To correct, apply this rule: if the second peak amplitude below initially chosen pitch is approximately 1/2 of the chosen pitch AND the ratio of amplitudes is above a threshold (e.g., 0.2 for 5 harmonics), THEN select the lower octave peak as the pitch for the current frame". – Hans Passant Aug 30 '16 at 16:19
  • You really nailed that 440 Hz note! (The 220 HZ harmonic is about as tall as the 440.) Here’s some of my experimentation (in Matlab…) with the data using short-time Fourier transform: https://gist.github.com/fasiha/957035272009eb1c9eb370936a6af2eb – Ahmed Fasih Aug 30 '16 at 17:20
  • @AhmedFasih haha thanks.. I can filter out the harmonics through pass filter, right? What should I do? My biggest problem is how to get the f0 from the complex value the HPS gives. I know I did something wrong when I'm taking the magnitude. – Giggity Aug 30 '16 at 17:41
  • Oh. If you have a complex number and you want to get its magnitude, you multiply it by its conjugate, i.e., `magnitude(a + j*b) = sqrt((a + j*b) * (a - j*b)) = sqrt((a*a + b*b))`. (You seem to just be multiplying the complex number by itself which gives `(a + j*b) * (a + j*b) = a*a - b*b +j * 2*a*b`, a complex number.) This seems to be what you need: [Complex.Conjugate](https://msdn.microsoft.com/en-us/library/system.numerics.complex.conjugate(v=vs.110).aspx) – Ahmed Fasih Aug 30 '16 at 17:49

2 Answers2

1

To get a pitch estimate, you have to divide your sumed bin frequency estimate by the downsampling ratio used for that sum.

Added: You should also sum the magnitudes (abs()), not take the magnitude of the complex sum.

But the harmonic product spectrum algorithm (HPS), especially when using only integer ratios of downsampling, doesn't usually provide better pitch estimation resolution. Instead, it provides a more robust rough pitch estimate (less likely to be fooled by a harmonic) than using a single bare FFT magnitude peak for sequential overtone rich timbres that have weak or missing fundamental spectral content.

If you know how to downsample a spectrum by fractional ratios (using interpolation, etc.), you can try finer grained downsampling to get a better pitch estimate out of HPS. Or you can use an HPS result to inform you of a narrower frequency range in which to search using another pitch or frequency estimation method.

hotpaw2
  • 70,107
  • 14
  • 90
  • 153
  • Can you give me a sample code to convert the HPS result which is in complex, to frequency (float)? Through audio pass filters, I can filter out the harmonics, right? – Giggity Aug 30 '16 at 17:25
  • Your magnitude() function looks correct. If you filter out what you think is the harmonics frequency band, you may end up with nothing left for some timbres and pitches. – hotpaw2 Aug 30 '16 at 18:33
  • Okay, let's say I have 437.5Hz and 445.312Hz. How to interpolate those two values? It seems like I should multiply it some value(let's say x and y) like this (437.50*x + 445.312*y) / 2 and should divide it by the number of values, in this case, two. Am I correct? What should be the x and y values? – Giggity Aug 30 '16 at 19:53
  • Look up parabolic interpolation and Sinc kernel interpolation (bandlimited reconstruction). – hotpaw2 Aug 30 '16 at 20:02
  • Okay, I'll try them – Giggity Aug 30 '16 at 20:12
1

I implemented harmonic product spectrum in Python to make sure your data and algorithm were working nicely.

Here’s what I see when applying harmonic product spectrum to the full dataset, Hamming-windowed, with 5 downsample–multiply stages:

Full data

This is just the bottom kilohertz, but the spectrum is pretty much dead above 1 KHz.

If I chunk up the long audio clip into 8192-sample chunks (with 4096-sample 50% overlap) and Hamming-window each chunk and run HPS on it, this is the matrix of HPS. This is kind of a movie of the HPS spectrum over the entire dataset. The fundamental frequency seems to be quite stable.

0-500 Hz

The full source code is here—there’s a lot of code that helps chunk the data and visualize the output of HPS running on the chunks, but the core HPS function, starting at def hps(…, is short. But it has a couple of tricks in it.

Given the strange frequencies that you’re finding the peak at, it could be that you’re operating on the full spectrum, from 0 to 44.1 KHz? You want to only keep the “positive” frequencies, i.e., from 0 to 22.05 KHz, and apply the HPS algorithm (downsample–multiply) on that.

But assuming you start out with a positive-frequency-only spectrum, take its magnitude properly, it looks like you should get reasonable results. Try to save out the output of your HarmonicProductSpectrum to see if it’s anything like the above.

Again, the full source code is at https://gist.github.com/fasiha/957035272009eb1c9eb370936a6af2eb. (There I try out another couple of spectral estimator, Welch’s method from Scipy and my port of the Blackman-Tukey spectral estimator. I’m not sure if you are set on implementing HPS or if you would consider other pitch estimators, so I’m leaving the Welch/Blackman-Tukey results there.)


Original I wrote this as a comment but had to keep revising it because it was confusing so here’s it as a mini-answer.

Based on my brief reading of this intro to HPS, I don’t think you’re taking the magnitudes correctly after you find the four decimated responses.

You want:

array[i] = sqrt(data[i] * Complex.conjugate(data[i]) *
                hps2[i] * Complex.conjugate(hps2[i]) *
                hps3[i] * Complex.conjugate(hps3[i]) *
                hps4[i] * Complex.conjugate(hps4[i]) *
                hps5[i] * Complex.conjugate(hps5[i])).X;

This uses the sqrt(x * Complex.conjugate(x)) trick to find x’s magnitude, and then multiplies all 5 magnitudes.

(Actually, it moves the sqrt outside the product, so you only do one sqrt, saves some time, but gives the same result. So maybe that’s another trick.)

Final trick: it takes that result’s real part because sometimes due to float accuracy issues, a tiny imaginary component, like 1e-15, survives.

After you do this, array should contain just real floats, and you can apply the max-bin-finding.


If there’s no Conjugate method, then the old-fashioned way should work:

public float mag2(Complex c) { return c.X * c.X + c.Y * c.Y; }

// in HarmonicProductSpectrum 
array[i] = sqrt(mag2(data[i]) * mag2(hps2[i]) * mag2(hps3[i]) * mag2(hps4[i]) * mag2(hps5[i]));

There’s algebraic flaws with the two approaches you suggested in the comments below, but the above should be correct. I’m not sure what C# does when you assign a Complex to a float—maybe it uses the real component? I’d have thought that’d be a compiler error, but with the above code, you’re doing the right thing with the complex data, and only assigning a float to array[i].

Ahmed Fasih
  • 6,458
  • 7
  • 54
  • 95
  • The problem is Ahmed I'm not using System.Numerics.Complex here. I'm using [NAudio.dsp.Complex](https://github.com/SjB/NAudio/blob/master/NAudio/Dsp/Complex.cs) and it doesn't have Conjugate function. How to do this manually? What if I do `array[i] = data[i].X * hps2[i].X * hps3[i].X * hps4[i].X * hps5[i].X;` then next line `magnitude[i] = sqrt (array[i]*array[i]);` and don't you need the imaginary part for magnitude? In FFT to get magnitude from the result we do `magnitude[i] = sqrt(result[i].X*result[i].X + result[i].Y*result[i].Y) ;` – Giggity Aug 30 '16 at 18:54
  • Oh sorry I forgot to mention `magnitude[i] = sqrt(result[i].X*result[i].X + result[i].Y*result[i].Y) ;` in FFT I'm taking the conjugate manually, it's like `sqrt((real + imaginary*i) * (real - imaginary*i)) = sqrt (real^2 + imaginary^2)` because i^2 = -1 .. What's happening to the imaginary part in HPS when calculating the magnitude? – Giggity Aug 30 '16 at 19:34
  • See my addendum, it should do what you need. Both of your suggestions I think have algebraic issues. – Ahmed Fasih Aug 30 '16 at 19:47
  • I tried what you've said but I'm getting this 2411.719 and 2454.785. How many time should i downsample and does it really impact the end-value? – Giggity Aug 31 '16 at 12:38
  • @Giggity see my edits for visualization of the algorithm using Python. Sorry, I don’t know anything about C# but the algorithm with your WAVE file definitely seem to work. – Ahmed Fasih Sep 01 '16 at 20:46
  • @Giggity added a lot more documentation to my Python source code (and updated plots). I’m getting nice results with this, hopefully you can find out what’s up with your C# implementation. – Ahmed Fasih Sep 02 '16 at 02:07
  • Hey, it's working now. But there's a small issue I'm getting 437.5Hz and 445.312Hz for the A4(440) note. Interpolation can solve this out, right? .. Need a small help to implement interpolation, So if I'm going to interpolate, I should take the index of the max magnitude bin, right? Then check the previous value is same as the current value if not then what to do? Should I do linear interpolation or cosine interpolation? – Giggity Sep 02 '16 at 08:40
  • @Giggity just zero-pad your original audio data before FFT (that's what `Nfft` does in my `hps` function). I.e., i your data segment has 8K samples, create a new array 16K big, fill it with zeros, and copy your 8K data samples to the beginning of this big array. Take the FFT of this zero-padded array. This interpolates the spectrum in the Fourier domain much better than time-domain linear/cosine interpolation. – Ahmed Fasih Sep 02 '16 at 10:59
  • @Giggity The more zero-padding, the more interpolation. It looks like your current frequency spacing is ~7.8 Hz. If you zero-pad by 8x (i.e., zeropad 8K segment to be 64K-long) before FFT, your frequency spacings will drop to ~-.98 Hz, and your spectral peak should have at least 8 points describing it. If you want *even more* interpolation, then use linear/cubic interpolation on the peak (that’s what we do in radar: zero-pad to get 8x or 16x Fourier/sinc-interpolation, then linearly interpolate the rest of the way). – Ahmed Fasih Sep 02 '16 at 11:02
  • @Giggity the reason you don't want to use linear/spline interpolation on the original spectrum is because you might just have 2–3 points describing your peak, and interpolating those will be really noisy. Zero-padding the original audio, and letting the FFT perform sinc-interpolation on the spectrum, will give you "optimal" interpolation, at the cost of memory to store the bigger time-doman input & CPU to compute the longer FFT. – Ahmed Fasih Sep 02 '16 at 11:07
  • @Giggity http://www.dspguide.com/ch9/1.htm “the length of the input signal does not have to be the same as the length of the DFT. For example, a 256 point signal could be padded with zeros to make it 2048 points long. Taking a 2048 point DFT produces a frequency spectrum with 1025 samples. The added zeros don't change the shape of the spectrum, they only provide more samples in the frequency domain.” – Ahmed Fasih Sep 02 '16 at 11:09