6

For anyone not familiar with Verizon's SongID program, it is a free application downloadable through Verizon's VCast network. It listens to a song for 10 seconds at any point during the song and then sends this data to some all-knowing algorithmic beast that chews it up and sends you back all the ID3 tags (artist, album, song, etc...)

The first two parts and last part are straightforward, but what goes on during the processing after the recorded sound is sent?

I figure it must take the sound file (what format?), parse it (how? with what?) for some key identifiers (what are these? regular attributes of wave functions? phase/shift/amplitude/etc), and check it against a database.

Everything I find online about how this works is something generic like what I typed above.

From audiotag.info

This service is based on a sophisticated audio recognition algorithm combining advanced audio fingerprinting technology and a large songs' database. When you upload an audio file, it is being analyzed by an audio engine. During the analysis its audio “fingerprint” is extracted and identified by comparing it to the music database. At the completion of this recognition process, information about songs with their matching probabilities are displayed on screen.

Machavity
  • 30,841
  • 27
  • 92
  • 100
CheeseConQueso
  • 5,831
  • 29
  • 93
  • 126

1 Answers1

5

All of these services work by taking a "fingerprint" from the sampled audio data on the client side, sending it to a server and comparing it against a fingerprint database.

One of the developers of Shazam has written an extremely informative white paper on how the technology works. This should give you all of the information that you need.

Stu Mackellar
  • 11,510
  • 1
  • 38
  • 59
  • this is what i was looking for - what did you search for and where did you search for it? great stuff – CheeseConQueso May 21 '10 at 19:27
  • 1
    I read it last year. It's an area of interest for me as I work with similar technology. – Stu Mackellar May 21 '10 at 19:28
  • @Stu - (after a quick read) there is no specific mention of what environment this runs under... if you have any thoughts, add to your answer what you think the best environment would be to handle these specs – CheeseConQueso May 21 '10 at 19:30
  • Both the client and server components should almost certainly be written in C/C++ for speed. OS doesn't matter. Check out http://www.fftw.org/ for a fast and open-source FFT implementation. Remember that these algorithms are covered by several patents. – Stu Mackellar May 21 '10 at 19:36
  • also, it looks like the fingerprint a function of time vs. frequency - both of which get distorted when output and input source distances/locations/blockages are changed (doppler/phasing) - seems like its not a good 'fingerprint' – CheeseConQueso May 21 '10 at 19:37
  • I think > 50 million Shazam users would probably beg to differ :-) http://www.shazam.com/music/web/pages/about.html – Stu Mackellar May 21 '10 at 19:39