5

I want to programmatically create a SHA1 checksum of audio files (MP3, Ogg Vorbis, Flac). The requirement is that the checksum should be stable even if the header (eg. ID3) changes.
Note: The audio files don't have CRCs

This is what I tried by now:

1) Reading + Hashing all MPEG frames using Perl and MPEG::Audio::Frame

my $sha1 = Digest::SHA1->new;
while (my $frame = MPEG::Audio::Frame->read(\*FH)) {
    $sha1->add($frame->content());
}

2) Decoding + Hashing all MPEG frames using Python and libmad (pymad)

mf = mad.MadFile(path)
sha1 = hashlib.sha1()

while 1:
    buf = mf.read()
    if (buf is None):
        break
    sha1.update(buf)

3) Using mp3cat

> mp3cat - - < file.mp3 | sha1sum

However, none of those methods provided a stable checksum. Namely, in some cases the checksum changed after retagging the file with picard.

Are there any libraries that already provide what I want?
I don't care about the programming language…

Update: I debugged the case a bit further. The libmad checksum inconsitency seems to happen in cases where libmad gets some decoding errors, like "Huffman data overrun (0x0238)". As this really happens on many of the mp3 files I'm not sure if it really indicates a broken file…

Benedikt Waldvogel
  • 12,406
  • 8
  • 49
  • 61
  • FWIW: [a feature request to add safe copy option to picard](http://tickets.musicbrainz.org/browse/PICARD-450) which should include some sort of checksum. – derekv Jan 04 '16 at 22:17

6 Answers6

3

If you are looking for stable hashes for the actual music you might want to look at libOFA. Your current methods will give you different results because the formats can have embedded tags. Also if you want two different files with the same song to return the same hash you need to regard things like bitrate and sample frequencies.

libOFA on the other hand can give you a stable hash that can be used between formats and different encodings. Might be what you want?

Tobias R
  • 91
  • 6
2

I needed tools to quickly check if my MP3/OGG library is still valid. For MP3 I found mp3md5.py (http://snipplr.com/view/4025/mp3-checksum-in-id3-tag/) which does the job, but no simple tool for OGG Vorbis, but I coded a little bash script to do this for me. Both tools should tolerate modifications of the comment/ID3Tag.

#!/bin/bash

# This bash script appends an MD5SUM to the vorbiscomment and/or verifies it if it exists
# Later modification of the vorbis comment does not alter the MD5SUM
# Julian M.K.

FILE="$1"

if [[ ! -f "$FILE" || ! -r "$FILE" || ! -w "$FILE" ]] ; then
  echo "File $FILE" does not exist or is not readable or writable
  exit 1
fi

OLDCRC=`vorbiscomment "$FILE" | grep ^CRC=|cut -d "=" -f 2`
NEWCRC=`ogginfo "$FILE" |grep "Total data length:" |cut -d ":" -f 2 | md5sum |cut -d " " -f 1`

if [[ "$OLDCRC" == "" ]] ; then
  echo "ADDED $FILE  $NEWCRC"
  vorbiscomment -a -t "CRC=$NEWCRC" "$FILE" 
  # rewrite CRC to get proper data length, I dont know why this is necessary
  NEWCRC=`ogginfo "$FILE" |grep "Total data length:" |cut -d ":" -f 2 | md5sum |cut -d " " -f 1`
  vorbiscomment -w -t "CRC=$NEWCRC" "$FILE" 
elif [[ "$OLDCRC" == "$NEWCRC" ]]  ; then
  echo "VERIFIED $FILE"
else
  echo "FAILURE $FILE -- $OLDCRC - $NEWCRC"
fi
Chris
  • 103
  • 3
  • Am I wrong or does your code only calculate a checksum of the string that describes the file size instead of the actual music data? What do I miss? – Chris Jan 12 '16 at 21:06
1

There is an easy stable way to do it. Just make a copy of the file and remove all the tags from it (e.g. using mutagen.id3) and take the hashsum of the resulting file.

The only disadvantage of this method is its performance.

  • 1
    I think the assumption you have made is not accurate. Two audio files may have the same music and no tags, but not match byte-by-byte. A simple reason may be different amounts of padding. Another may be that blocks simply have been freed but not cleared. Also note that tags are not the only kind of metadata in audio files. – brainchild Jun 06 '21 at 10:23
1

Update many years later:

See my answer here to a very similar question. It turns out that ffmpeg actually supports doing checksums of the individual streams. To get the md5 hash of only the audio stream:

ffmpeg -i "$filename" -map 0:a -codec copy -f md5 "$filename.md5"

There is also support for other hash formats with the generic -f hash format, or for doing it per frame with -f framemd5.


I'm trying to do the same thing. I used MD5 instead of SHA1. I started to export audio checksums using mp3tag (www.mp3tag.de/en/); then made a Perl script similar to yours to do the same thing. Then I removed all tags from my test file, and the audio checksum remained the same.

This is the script:

use MPEG::Audio::Frame;
use Digest::MD5 qw(md5_hex);
use strict;

my $file = 'E:\Music\MP3\Russensoul\01 - 5nizza , Soldat (Russensoul - Russensoul).mp3';
my $mp3tag_audio_md5 = lc '2EDFBD62995A46A45CEEC08C1F303486';

my $md5 = Digest::MD5->new;

open(FILE, $file) or die "Cannot open $file : $!\n";
binmode FILE;

while(my $frame = MPEG::Audio::Frame->read(\*FILE)){
    $md5->add($frame->asbin);
}

print '$md5->hexdigest  : ', $md5->hexdigest, "\n",
      'mp3tag_audio_md5 : ', $mp3tag_audio_md5,  "\n",
      ;

Is it possible that whatever you use to modify your tags sometimes also modifies mp3 headers?

mivk
  • 13,452
  • 5
  • 76
  • 69
0

Bene, If I were you, (And I am in the process of working on something very similar to what you want to do), I would hash the mp3 data block. (Extract it to raw data first, and write it out to disk, so you know what you are dealing with). Then, modify the ID3 tag. Hash your data again. Now, if it changes, compare your two sets of raw data and find out WHERE it changed. Chances are, you might be over-stepping a boundary somewhere. If I recall, MP3 files start with something like FF F8. Well, at least the frame does.

I'm interested in your findings, as I'm still writing all my code to deal with the finger prints, etc, and haven't gotten to the actual hashing yet.

GEOCHET
  • 21,119
  • 15
  • 74
  • 98
LarryF
  • 4,925
  • 4
  • 32
  • 40
0

Just buttoning up mvik's ffmpeg-based answer into a shell script:

#!/bin/bash

# Compute the MD5 of the audio stream of an MP3 file, ignoring ID3 tags.

# The problem with comparing MP3 files is that a simple change to the ID3 tags
# in one file will cause the two files to have differing MD5 sums.  This script
# avoids that problem by taking the MD5 of only the audio stream, ignoring the
# tags.

# Note that by virtue of using ffmpeg, this script happens to also work for any
# other audio file format supported by ffmpeg (not just MP3's).

set -e

stdoutf=$( mktemp mp3md5.XXXXXX )
stderrf=$( mktemp mp3md5.XXXXXX )

set +e
ffmpeg -i "$1" -c:a copy -f md5 - >$stdoutf 2>$stderrf
ret=$?
set -e

if test $ret -ne 0 ; then
    cat $stderrf
else
    cat $stdoutf | sed 's/MD5=//'
fi

rm -f $stdoutf $stderrf
exit $ret