I guess you only need to find an offset, but either way, there's how to first find the similarity and then how to find the offset from the short file into the long file
Measuring Similarity
First you need to decode them into PCM and ensure it has specific sample rate, which you can choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.
You can use the following code for that:
ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav
And below there's a code to get a number from 0 to 100 for the similarity from two audio files using python, it works by generating fingerprints from audio files and comparing them based out of them using cross correlation
It requires Chromaprint and FFMPEG installed, also it doesn't work for short audio files, if this is a problem, you can always reduce the speed of the audio like in this guide, be aware this is going to add a little noise.
# correlation.py
import subprocess
import numpy
# seconds to sample audio file for
sample_time = 500# number of points to scan cross correlation over
span = 150# step size (in points) of cross correlation
step = 1# minimum number of points that must overlap in cross correlation
# exception is raised if this cannot be met
min_overlap = 20# report match when cross correlation has a peak exceeding threshold
threshold = 0.5
# calculate fingerprint
def calculate_fingerprints(filename):
fpcalc_out = subprocess.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename))
fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
# convert fingerprint to list of integers
fingerprints = list(map(int, fpcalc_out[fingerprint_index:].split(',')))
return fingerprints
# returns correlation between lists
def correlation(listx, listy):
if len(listx) == 0 or len(listy) == 0:
# Error checking in main program should prevent us from ever being
# able to get here.
raise Exception('Empty lists cannot be correlated.')
if len(listx) > len(listy):
listx = listx[:len(listy)]
elif len(listx) < len(listy):
listy = listy[:len(listx)]
covariance = 0
for i in range(len(listx)):
covariance += 32 - bin(listx[i] ^ listy[i]).count("1")
covariance = covariance / float(len(listx))
return covariance/32
# return cross correlation, with listy offset from listx
def cross_correlation(listx, listy, offset):
if offset > 0:
listx = listx[offset:]
listy = listy[:len(listx)]
elif offset < 0:
offset = -offset
listy = listy[offset:]
listx = listx[:len(listy)]
if min(len(listx), len(listy)) < min_overlap:
# Error checking in main program should prevent us from ever being
# able to get here.
return
#raise Exception('Overlap too small: %i' % min(len(listx), len(listy)))
return correlation(listx, listy)
# cross correlate listx and listy with offsets from -span to span
def compare(listx, listy, span, step):
if span > min(len(listx), len(listy)):
# Error checking in main program should prevent us from ever being
# able to get here.
raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) + 'Reduce span, reduce crop or increase sample_time.')
corr_xy = []
for offset in numpy.arange(-span, span + 1, step):
corr_xy.append(cross_correlation(listx, listy, offset))
return corr_xy
# return index of maximum value in list
def max_index(listx):
max_index = 0
max_value = listx[0]
for i, value in enumerate(listx):
if value > max_value:
max_value = value
max_index = i
return max_index
def get_max_corr(corr, source, target):
max_corr_index = max_index(corr)
max_corr_offset = -span + max_corr_index * step
print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset)
# report matches
if corr[max_corr_index] > threshold:
print(('%s and %s match with correlation of %.4f at offset %i' % (source, target, corr[max_corr_index], max_corr_offset)))
def correlate(source, target):
fingerprint_source = calculate_fingerprints(source)
fingerprint_target = calculate_fingerprints(target)
corr = compare(fingerprint_source, fingerprint_target, span, step)
max_corr_offset = get_max_corr(corr, source, target)
if __name__ == "__main__":
correlate(SOURCE_FILE, TARGET_FILE)
Code converted into python 3 from: https://shivama205.medium.com/audio-signals-comparison-23e431ed2207
Finding offset
Like earlier you need to decode them into PCM and ensure it has specific sample rate.
Again you can use the following code for that:
ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav
Then you can use the following code, it normalizes PCM data (i.e. find maximum sample value and rescale all samples so that sample with largest amplitude uses entire dynamic range of data format) and then converts it to spectrum domain (FFT) and finds a peak using cross correlation to finally return the offset in seconds
Depending of your case, you may want to avoid normalizing PCM data, which then you would need change a litte the code below
import argparse
import librosa
import numpy as np
from scipy import signal
def find_offset(within_file, find_file, window):
y_within, sr_within = librosa.load(within_file, sr=None)
y_find, _ = librosa.load(find_file, sr=sr_within)
c = signal.correlate(y_within, y_find[:sr_within*window], mode='valid', method='fft')
peak = np.argmax(c)
offset = round(peak / sr_within, 2)
return offset
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--find-offset-of', metavar='audio file', type=str, help='Find the offset of file')
parser.add_argument('--within', metavar='audio file', type=str, help='Within file')
parser.add_argument('--window', metavar='seconds', type=int, default=10, help='Only use first n seconds of a target audio')
args = parser.parse_args()
offset = find_offset(args.within, args.find_offset_of, args.window)
print(f"Offset: {offset}s" )
if __name__ == '__main__':
main()
Source and further explanation: https://dev.to/hiisi13/find-an-audio-within-another-audio-in-10-lines-of-python-1866
Then you would need depending of your case to combine these two piece of code, maybe you only want to find the offset in cases where the audio is similar, or the other way around.