How to implement granular synthesis(pitch shifter)?

Question

Since music speed and pitch is coupled together, if I speed up music, pitch is also increased. And conversely, if I slow down music, pitch is also decreased.

However, I saw that using granular synthesis, I can decouple speed and pitch. So, I'm currently trying hard to implement granular synthesis.

First of all, I think I succeeded in implementing double speed and half speed, while pitch is same. The code is as same as the following:

※ grain size is 2000. It means that I use 0.04ms of sound as one grain. (2000 samples * 1 s / 44100 samples = 0.04s = 40ms)

// get music
const $fileInput = document.createElement('input');
$fileInput.setAttribute('type', 'file');
document.body.appendChild($fileInput);

$fileInput.addEventListener('change', async (e) => {
  const music = await $fileInput.files[0].arrayBuffer();
  const actx = new (window.AudioContext || window.webkitAudioContext)({ latencyHint: 'playback', sampleRate: 44100 });
  const audioData = await actx.decodeAudioData(music);
  const original = audioData.getChannelData(0);
  const arr = [];
  const grainSize = 2000;

  // Please choose one code out of double speed code or half speed code
  // copy and paste audio processing code here
});

// double speed
// ex: [0,1,2,3, 4,5,6,7, 8] => [0,1, 4,5, 8] discard 2 items out of 4 items
for (let i = 0; i < original.length; i += grainSize) {
  if (original[i + (grainSize / 2) - 1] !== undefined) {
    for (let j = 0; j < grainSize / 2; j++) {
      arr.push(original[i + j]);
    }
  } else {
    for (let j = i; j < original.length; j++) {
      arr.push(j);
    }
  }
}

// half speed
// ex: [0,1, 2,3, 4] => [0,1,0,0, 2,3,0,0, 4,0,0] add 'two' zeros after every 'two' items
for (let i = 0; i < original.length; i += grainSize) {
  if (original[i + grainSize - 1] !== undefined) {
    for (let j = 0; j < grainSize; j++) {
      arr.push(original[i + j]);
    }
  } else {
    for (let j = i; j < original.length; j++) {
      arr.push(original[j]);
    }
  }

  for (let j = 0; j < grainSize; j++) {
    arr.push(0);
  }
}

// play sound
const f32Arr = Float32Array.from(arr);
const audioBuffer = new AudioBuffer({ length: arr.length, numberOfChannels: 1, sampleRate: actx.sampleRate });
  
audioBuffer.copyToChannel(f32Arr, 0);

const absn = new AudioBufferSourceNode(actx, { buffer: audioBuffer });
  
absn.connect(actx.destination);
absn.start();

But the problem is, I totally have no idea how to implement pitch shifter (that is, different pitch, same speed).

As far as I think, same speed means same AudioBuffer size. Therefore, the only variable in my hand is grain size. But I seriously don't know what should I do. It would be greatly appreciated if you share some of your knowledge. Thank you very much!

To Phil Freihofner

Hello, thank you for the kind explanation. I tried your method. As far as I understand, your method is a process that does the following:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] // input data (10 samples)
→ [0, 2, 4, 6, 8] // double speed, 1 octave high (sampling interval: 2)
→ [0, 0, 2, 2, 4, 4, 6, 6, 8, 8] // change duration

The result sounds 1 octave high with same duration (successful pitch shifting). However, I don't know what should I do if I do sampling from the input data with sampling interval 1.5? What I mean is I have no idea how to make the length of [0, 1, 3, 4, 6, 7, 9] as the same length with the input data.

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] // input data
// from
[0, 1, 3, 4, 6, 7, 9] // sampling interval: 1.5
// to
[?, ?, ?, ?, ?, ?, ?, ?, ?, ?]

Meanwhile, I learned that pitch shift can be achieved by a way and as far as I understand, the way is as following:

[1] make each granule be started at the same position as original source
[2] play each granule with different speed.

In addition, I found that I can achieve pitch shifting and time stretching if I transform an input data like the following:

input data = [0, 1, 2, 3, 4, 5, 6, 7]
grain size = 4

<pitch shifting>

in case of p = 2
result = [0, 2, 0, 2, 4, 6, 4, 6] // sounds like one octave high
// If I remember correctly, [0, 2, 0, 0, 4, 6, 0, 0] is also fine
// (and it is more fit to the definition above ([1] and [2])
// but the sound was not good (stuttering).
// I found that [0, 2, 0, 2...] is better.

in case of p = 1.5
result = [0, 1, 3, 0, 4, 5, 7, 4]

in case of p = 0.5
result = [0, 0, 1, 1, 4, 4, 5, 5] // sounds like one octave low

<time stretching>

in case of speed = 2
result = [0, 1, 4, 5]

in case of speed = 1.2
result = [0, 1, 2, 4, 5, 6]

in case of speed = 0.5
result = [0, 1, 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 4, 5, 6, 7]
// If I remember correctly, [0, 1, 2, 3, 0, 0, 0, 0...] is also fine
// but the sound was not good (stuttering)

in case of speed = 0.75
result = [0, 1, 2, 3, 0, 4, 5, 6, 7, 4]

Anyway, thank you for the answer.

Phil Freihofner · Answer 1 · 2022-03-02T17:35:13.810

I've not read your code close enough to comment on it specifically, but I can comment on the general theory of how pitch shifting is accomplished.

The granules are usually given volume envelopes, with a fade-in and fade-out. I've seen the Hann function (Hanning Window) mentioned as a possibility. Also, the granules are overlapped, with the windowing creating a cross-fade, in effect.

Let's say a granule is 2000 frames, but with the windowing. If you make a granule at every 1000 frames and play them back, overlapping, at the same spacing (every 1000 frames), you should hear the equivalent of the original sound.

Varying the playback distance between the overlapping granules is how the different time lengths of the sound are accomplished. For example, instead of playing a granule every 1000 frames, use 900 or 1100.

I'm pretty sure there are factors to take into consideration concerning the size and shape of the windowing and the range of possible intervals between the granules, but I am not up on them. My simple experiments with this have been with Java and mostly work, but with some artificiality creeping into the playback.

I think consulting at StackOverflow's Signal Processing site would be a good bet for getting more info on the specifics.

EDIT: I just realized that I misread your question! You were asking about changing the pitch while retaining the length of time over which the sounds play. I don't know if this is the "best" way, but I'd consider a plan of doing this in two steps. First, change the sound to the desired pitch. Then, alter the duration of the new sound to be that of the original sound.

The first step can be done with linear interpolation. I tried to explain how to do this in a previous question. For the second step, we break the transformed wave into granules.

However, I just noticed, Spektre has an additional answer on that post that does directly using what you ask, via using FFT. This is probably a better way, but I haven't tried implementing it myself.

EDIT 2, in response to the question added to the OP:

Given, PCM data for 10 frames as follows [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] (I'm using signed floats ranging from -1 to 1. You may have to scale this to convert to your format.)

To change the pitch of the playback to 1.5x (but also changes the length) we get the following data: [0, 0.15, 0.3, 0.45, 0.6, 0.75, 0.9]

The 0.15 is a value that is halfway between points 0.1 and 0.2, arrived at by linear interpolation. If the speed were 1.25x, the data points would be as follows: [0, 0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, ?? (depends on what follows 0.9)].

The linear intepolation calculation for index 1 in the sequence is as follows: pitchShiftedAudioData1 = originalPCM1 * (1 - 0.25) + originalPCM2 * 0.75;

In other words, since we land at a point that is 0.25 of the way in between originalPCM1 and originalPCM2, the above calculates what that value would be if the data progressed linearly from 1 to 2.

After doing all this, there would still remain additional steps to form the pitch-shifted data into granules. One has to use a windowing function for each granule. If the window were only 10 frames long (far too short, but will illustrate), a possible window might be the following: [0.01, 0.15 , 0.5 , 0.85 , 1, 1, 0.85, 0.5, 0.15 , 0.01]. (In actuality, it should follow the Hann function.)

This is applied to the data from the different starting points, to create the granules, where N is the index in the array of the signal.

[ signal[N] * window[0], signal[N+1] * window1, signal[N+2] * window2, ..., signal[N+10] * window[10] ]

To create the new signal, the resulting granules are placed sequentially, overlapping, and summed. The relative placement of the granules (how close together or far apart) determines the timing. This is my naive understanding of a brute-force way to accomplish time-shifting, and I've had some OK, not great, results.

I hope this clarifies what I was attempting to describe somewhat!

If you aren't able to follow, please consider unchecking this as the answer. Others may participate that will be able to provide easier to understand information or corrections.

Time-shifting is pretty advanced, IMHO, so expect some complicated calculations and coding (unless someone has a tool to recommend).

How to implement granular synthesis(pitch shifter)?

1 Answers1