German mathematician David Hilbert published what we call the Hilbert Curve which defines how to trace across each pixel in an image visiting each once and only once without skipping any pixels starting in one corner and ending in another. This is how you can transform the 2D image into a 1D representation.
On this resulting line you have recorded the light intensity measurements per pixel ... now to synthesize audio from this image we map the spectrum of human hearing from low to highest frequencies onto this line ... atop each pixel point on the line you introduce a sine curve oscillator generating a constant single tone at a frequency appropriate to its position going left to right on your line ... synthesize audio from all of these oscillators simultaneously ... this is the sound of the image
Beautiful aspect is it goes equally well in reverse ... from an arbitrary audio to its image representation ... its not lost that streaming audio can get transformed into a video stream ... to do this without loosing information if you wish to go round robin (audio->video->audio->...) gets tricky as the simple method to collapse the RGBA of each pixel into a single light intensity value with the ambition to avoid losing information you really need a multidimensional structure instead of a simple 1D line
Here is a video explaining this Hilbert Curve
Its a nice self contained idea which is computationally simple enough to perform in real-time for a decent image resolution at a respectable audio sample rate and bit depth ... I am actively working on this question and will report back my findings with a code repo
Regarding implementation details there are libraries to reveal the RGBA pixel values of a supplied image ... also short of writing your own Hilbert Curve algorithm it too has libraries ... once all of your frequency oscillators are humming away you will need to sample all of them simultaneously to synthesize one output sample as the aggregate curve height at a given sample point in time ... then to be able to hear the audio output I would use Web Audio API where you could feed its event loop memory buffer with your aggregate curve audio ... not a trivial approach yet very doable
The human brain is plastic enough to get trained to hear what others see such that the blind would use this to cast away their walking cane !!! what joy ...