Possible to reconstruct audio only with spectrogram image?

Question

So I'm creating some spectrograms with librosa to be saved as images, after which I intend to make modifications to the image directly (ie. add random noise, etc), then I would like to reconstruct the audio from that image.

Anyway, some research led me to examples of similar processes (see here or here) but nothing quite like I'm trying to do, which is take a png/jpg image of a spectrogram and convert it back to an usable audio file.

Here's the full code I'm using to generate the spec images:

import librosa
from librosa import display
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas

filename = librosa.util.example_audio_file()
y, sr = librosa.load(filename)
window_size = 1024
window = np.hanning(window_size)
stft = librosa.core.spectrum.stft(y, n_fft=window_size, hop_length=512, window=window)
out = 2 * np.abs(stft) / np.sum(window)

fig = plt.Figure()
canvas = FigureCanvas(fig)
ax = fig.add_subplot(111)
fig.subplots_adjust(left=0,right=1,bottom=0,top=1)
ax.axis('tight')
ax.axis('off')

p = librosa.display.specshow(librosa.amplitude_to_db(out, ref=np.max), ax=ax, y_axis='log', x_axis='time')
fig.savefig('spectrogram.png')

Which would produce this exact image: spectrogram.png

But functions like librosa.istft or librosa.griffinlim expect the output of librosa.core.spectrum.stft, and I haven't been able to reverse that entire process coming from just the image file. Assuming I had this picture, is there any way to build the audio back again (even if it's lossy)? What kind of other information would be necessary, and how could I do it?

Thanks in advance.

It is theoretically possible, as Fourier transform is reversible, but it would require images with much greater resolution. The width in pixels will limit the granularity of audio in the time domain, and the height limits it in frequency resolution. You might also end up with a wrong audio speed, if you don't know the original audio length. In essence: seems like your picture is too small. Otherwise, you'd just need to add all of the different frequencies together, w/ right amplitudes, and it _might_ work. Just change the amplitudes based on the pixel colors. — Błażej Michalik, Oct 14 '20 at 03:30
That's very interesting Blazej, thank you for the answer. Know of any resource I could read about adding the frequencies as you mentioned? — V Begha, Oct 15 '20 at 00:22
You might want look into [audio generation with pyaudio](https://stackoverflow.com/questions/9770073/sound-generation-synthesis-with-python) — Błażej Michalik, Oct 15 '20 at 09:59
See https://stackoverflow.com/questions/56931834/creating-wave-data-from-fft-data/57323359#57323359 — Jon Nordby, Nov 29 '20 at 19:38

Possible to reconstruct audio only with spectrogram image?

0 Answers0

Linked