How to Correlate Two Audio Events (Detect if they are Similar) in Python

Question

For my project I have to detect if two audio files are similar and when the first audio file is contained in the second. My problem is that I tried to use librosa the numpy.correlate. I don't know if I'm doing it in the right way. How can I detect if audio is contained in another audio file?

import librosa
import numpy
long_audio_series, long_audio_rate = librosa.load("C:\\Users\\Jerry\\Desktop\\long_file.mp3")
short_audio_series, short_audio_rate = librosa.load("C:\\Users\\Jerry\\Desktop\\short_file.mka")

for long_stream_id, long_stream in enumerate(long_audio_series):
    for short_stream_id, short_stream in enumerate(short_audio_series):
        print(numpy.correlate(long_stream, short_stream))

What kind of audio are these events? How long is a typical event? — Jon Nordby, Aug 02 '19 at 08:39
@jonnor 30 minutes is long_audio and the short audio 1:30 minute — Jerry Palmiotto, Aug 02 '19 at 09:51

Hendrik · Answer 1 · 2019-08-02T14:38:22.327

Simply comparing the audio signals long_audio_series and short_audio_series probably won't work. What I'd recommend doing is audio fingerprinting, to be more precise, essentially a poor man's version of what Shazam does. There is of course the patent and the paper, but you might want to start with this very readable description. Here's the central image, the constellation map (CM), from that article:

If you don't want to scale to very many songs, you can skip the whole hashing part and concentrate on peak finding.

So what you need to do is:

Create a power spectrogram (easy with librosa.core.stft).
Find local peaks in all your files (can be done with scipy.ndimage.filters.maximum_filter) to create CMs, i.e., 2D images only containing the peaks. The resulting CM is typically binary, i.e. containing 0 for no peaks and 1 for peaks.
Slide your query CM (based on short_audio_series) over each of your database CM (based on long_audio_series). For each time step count how many "stars" (i.e. 1s) align and store the count along with the slide offset (essentially the position of the short audio in the long audio).
Pick the max count and return the corresponding short audio and position in the long audio. You will have to convert frame numbers back to seconds.

Example for the "slide" (untested sample code):

import numpy as np

scores = {}
cm_short = ...  # 2d constellation map for the short audio
cm_long = ...   # 2d constellation map for the long audio
# we assume that dim 0 is the time frame
# and dim 1 is the frequency bin
# both CMs contains only 0 or 1
frames_short = cm_short.shape[0]
frames_long = cm_long.shape[0]
for offset in range(frames_long-frames_short):
    cm_long_excerpt = cm_long[offset:offset+frames_short]
    score = np.sum(np.multiply(cm_long_excerpt, cm_short))
    scores[offset] = score
# TODO: find the highest score in "scores" and
# convert its offset back to seconds

Now, if your database is large, this will lead to way too many comparisons and you will also have to implement the hashing scheme, which is also described in the article I linked to above.

Note that the described procedure only matches identical recordings, but allows for noise and slight distortion. If that is not what you want, please define similarity a little better, because that could be all kinds of things (drum patterns, chord sequence, instrumentation, ...). A classic, DSP-based way to find similarities for these features is the following: Extract the appropriate feature for short frames (e.g. 256 samples) and then compute the similarity. E.g., if harmonic content is of interest to you, you could extract chroma vectors and then calculate a distance between chroma vectors, e.g., cosine distance. When you compute the similarity of each frame in your database signal with every frame in your query signal you end up with something similar to a self similarity matrix (SSM) or recurrence matrix (RM). Diagonal lines in the SSM/RM usually indicate similar sections.

Usually the problem is formulated as "querying a database of audio documents with a sample". If you only have one *long* file, than that's your database. Your short file is your query. — Hendrik, Aug 02 '19 at 13:36
How can I slide my CM for match with the query, Sorry I am beginner in audio processing? — Jerry Palmiotto, Aug 02 '19 at 13:40
Create a CM for your long document and for your short document. Using numpy slicing, create an excerpt from the long document that is as long as you short document. Then simply `np.multiply` the two images and `np.sum` the result. That's your count. Now, to *slide*, choose a different excerpt from the long CM, shifted by one frame, and so on. — Hendrik, Aug 02 '19 at 13:45
Last question,How can I CM of peaks with only two audio files? — Jerry Palmiotto, Aug 02 '19 at 20:03
Each audio file must be converted to a "constellation map" (CM)—you know, it's just a metaphor. It's really just peaks in the spectrogram. — Hendrik, Aug 02 '19 at 20:06

Alejandro Garcia · Answer 2 · 2022-10-24T13:21:58.457

I guess you only need to find an offset, but either way, there's how to first find the similarity and then how to find the offset from the short file into the long file

Measuring Similarity

First you need to decode them into PCM and ensure it has specific sample rate, which you can choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.

You can use the following code for that:

ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav

And below there's a code to get a number from 0 to 100 for the similarity from two audio files using python, it works by generating fingerprints from audio files and comparing them based out of them using cross correlation

It requires Chromaprint and FFMPEG installed, also it doesn't work for short audio files, if this is a problem, you can always reduce the speed of the audio like in this guide, be aware this is going to add a little noise.

# correlation.py
import subprocess
import numpy
# seconds to sample audio file for
sample_time = 500# number of points to scan cross correlation over
span = 150# step size (in points) of cross correlation
step = 1# minimum number of points that must overlap in cross correlation
# exception is raised if this cannot be met
min_overlap = 20# report match when cross correlation has a peak exceeding threshold
threshold = 0.5
# calculate fingerprint
def calculate_fingerprints(filename):
    fpcalc_out = subprocess.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename))
    fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
    # convert fingerprint to list of integers
    fingerprints = list(map(int, fpcalc_out[fingerprint_index:].split(',')))      
    return fingerprints  
    # returns correlation between lists
def correlation(listx, listy):
    if len(listx) == 0 or len(listy) == 0:
        # Error checking in main program should prevent us from ever being
        # able to get here.     
        raise Exception('Empty lists cannot be correlated.')    
    if len(listx) > len(listy):     
        listx = listx[:len(listy)]  
    elif len(listx) < len(listy):       
        listy = listy[:len(listx)]      

    covariance = 0  
    for i in range(len(listx)):     
        covariance += 32 - bin(listx[i] ^ listy[i]).count("1")  
    covariance = covariance / float(len(listx))     
    return covariance/32  
    # return cross correlation, with listy offset from listx
def cross_correlation(listx, listy, offset):    
    if offset > 0:      
        listx = listx[offset:]      
        listy = listy[:len(listx)]  
    elif offset < 0:        
        offset = -offset        
        listy = listy[offset:]      
        listx = listx[:len(listy)]  
    if min(len(listx), len(listy)) < min_overlap:       
    # Error checking in main program should prevent us from ever being      
    # able to get here.     
        return   
    #raise Exception('Overlap too small: %i' % min(len(listx), len(listy))) 
    return correlation(listx, listy)  
    # cross correlate listx and listy with offsets from -span to span
def compare(listx, listy, span, step):  
    if span > min(len(listx), len(listy)):      
    # Error checking in main program should prevent us from ever being      
    # able to get here.     
        raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) + 'Reduce span, reduce crop or increase sample_time.')

    corr_xy = []    
    for offset in numpy.arange(-span, span + 1, step):      
        corr_xy.append(cross_correlation(listx, listy, offset)) 
    return corr_xy  
    # return index of maximum value in list
def max_index(listx):   
    max_index = 0   
    max_value = listx[0]    
    for i, value in enumerate(listx):       
        if value > max_value:           
            max_value = value           
            max_index = i   
    return max_index  

def get_max_corr(corr, source, target): 
    max_corr_index = max_index(corr)    
    max_corr_offset = -span + max_corr_index * step 
    print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset)
    # report matches    
    if corr[max_corr_index] > threshold:        
        print(('%s and %s match with correlation of %.4f at offset %i' % (source, target, corr[max_corr_index], max_corr_offset))) 

def correlate(source, target):  
    fingerprint_source = calculate_fingerprints(source) 
    fingerprint_target = calculate_fingerprints(target)     
    corr = compare(fingerprint_source, fingerprint_target, span, step)  
    max_corr_offset = get_max_corr(corr, source, target)  

if __name__ == "__main__":    
    correlate(SOURCE_FILE, TARGET_FILE)

Code converted into python 3 from: https://shivama205.medium.com/audio-signals-comparison-23e431ed2207

Finding offset

Like earlier you need to decode them into PCM and ensure it has specific sample rate.

Again you can use the following code for that:

ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav

Then you can use the following code, it normalizes PCM data (i.e. find maximum sample value and rescale all samples so that sample with largest amplitude uses entire dynamic range of data format) and then converts it to spectrum domain (FFT) and finds a peak using cross correlation to finally return the offset in seconds

Depending of your case, you may want to avoid normalizing PCM data, which then you would need change a litte the code below

import argparse

import librosa
import numpy as np
from scipy import signal


def find_offset(within_file, find_file, window):
    y_within, sr_within = librosa.load(within_file, sr=None)
    y_find, _ = librosa.load(find_file, sr=sr_within)

    c = signal.correlate(y_within, y_find[:sr_within*window], mode='valid', method='fft')
    peak = np.argmax(c)
    offset = round(peak / sr_within, 2)

    return offset


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--find-offset-of', metavar='audio file', type=str, help='Find the offset of file')
    parser.add_argument('--within', metavar='audio file', type=str, help='Within file')
    parser.add_argument('--window', metavar='seconds', type=int, default=10, help='Only use first n seconds of a target audio')
    args = parser.parse_args()
    offset = find_offset(args.within, args.find_offset_of, args.window)
    print(f"Offset: {offset}s" )


if __name__ == '__main__':
    main()

Source and further explanation: https://dev.to/hiisi13/find-an-audio-within-another-audio-in-10-lines-of-python-1866

Then you would need depending of your case to combine these two piece of code, maybe you only want to find the offset in cases where the audio is similar, or the other way around.

How to Correlate Two Audio Events (Detect if they are Similar) in Python

2 Answers2

Measuring Similarity

Finding offset

Linked