2

I have a system where video files are ingested and then multiple CPU intensive tasks are started. As these tasks are computationally expensive I would like to skip processing a file if it has already been processed.

Videos come from various sources so file names etc are not viable options.

If I was using pictures I would compare the MD5 hash but on a 5GB - 40GB video this can take a long time to compute.

To compare the 2 videos I am testing this method:

  • check relevant metadata matches
  • check length of file with ffmpeg / ffprobe
  • use ffmpeg to extract frames at 100 predfined timestamps [1-100]
  • create MD5 hashes of each of those frames
  • compare the MD5 hashes to check for a match

Does anyone know a more efficient way of doing this? Or a better way to approach the problem?

Filburt
  • 17,626
  • 12
  • 64
  • 115
  • Hash your files, and keep track of the hashes. Here's an example: http://stackoverflow.com/questions/304268/getting-a-files-md5-checksum-in-java – Russell Uhl Jun 13 '13 at 15:11

4 Answers4

2

I would start with file length (quick and dirt), continue with MD5 and finish with checking frames. Quick and easy.

Of course if you get an edited file it will give you false negatives, but then it will probably give you false negatives for MD5 and probably even with checking even frames; preventing false negatives due to edition would be so computationally expensive that it is probably better to just ignore them.

SJuan76
  • 24,532
  • 6
  • 47
  • 87
2

First, you need to properly define under which conditions two video files are considered the same. Do you mean exactly identical as in byte-for-byte? Or do you mean identical in content, then you need to define a proper comparison method for the content.

I'm assuming the first (exactly identical files). This is independent of what the files actually contain. When you receive a file, always build the a hash for the file, store the hash along with the file.

Checking for duplicates then is a multi-step process:

1.) Compare hashes, if you find no matching hash, file is new. In most cases of a new file you can expect this step to be the only step, a good hash (SHA1 or something bigger) will have few collisions for any practical number of files.

2.) If you found other files with the same hash, check file length. If they don't match, the file is new.

3.) If both hash and file length matched, you have to compare the entire file contents, stop when you find the first difference. If the entire file compare turns out to be identical the file it the same.

In the worst case (files are identical) this should take no longer than the raw IO speed for reading the two files. In the best case (hashes differ) the test will only take as much time as the hash lookup (in a DB or HashMap or whatever you use).

EDIT: You are concerned about the IO to build the hash. You may partially avoid that if you compare the file length first and skip everything of the file length is unique. On the other hand, you then need to also keep track for which files you already did build the hash. This would allow you to defer building the hash until you really need it. In case of a missing hash you could skip directly to comparing the two files, while building the hashes in the same pass. Its a lot more state to keep track of, but it may be worth it depending on your scenario (You need a solid data basis of how often duplicate files occur and their average size distribution to make a decision).

Durandal
  • 19,919
  • 4
  • 36
  • 70
  • For my application it is important that I don't miss any files, if there is a small chance of ingesting the files twice then I can live with that. In order to compare the hash I need to generate the hash on the inbound file. This surely also involves reading in the whole file from disk. It seems IO may be my limiting factor here? – Hugh M Halford-Thompson Jun 14 '13 at 11:10
  • Yes, if you want to be sure you have to read the entire file at least once to build the hash. In the bad case you need 2n+1 time filelength IO (n = number of hash collisions). So this will be very IO limited, processing power needs are neglible compared to IO here. – Durandal Jun 14 '13 at 12:25
1

Hash your files, and keep track of the hashes. Here's an example: Getting a File's MD5 Checksum in Java

keep in mind that, although extremely unlikely, it is mathematically possible to have two different files give the same hash. if you are dealing with ungodly large number of files (on the order of 2^128 files), then you need a better hash algorithm...like SHA2-256. But that's probably not the case here.

Community
  • 1
  • 1
Russell Uhl
  • 4,181
  • 2
  • 18
  • 28
  • I have tried the md5 checksum and it takes 4 minutes to hash a 8GB file. I didn't test with my 40GB files. 4 minutes / video high CPU processing will add considerable cost to the project. I was hoping for a faster approach. – Hugh M Halford-Thompson Jun 13 '13 at 23:08
  • @HughMHalford-Thompson that is a simple matter of finding a hashing algorithm. A quick search on google reveals xxhash: https://code.google.com/p/xxhash/ However, that has far fewer bits involved, so the possibility of duplicates is far greater. That said, we are still looking at 2^32 files before a guaranteed duplicate is found. If you use this hashing algorithm, include another check (something fast, like file length) to see if they match. – Russell Uhl Jun 14 '13 at 01:46
  • 2
    @RusselUhl If you take a look at his numbers, he processed ~67MB/s, so I presume he already hit the IO limit of the machine. A faster hashing algorithm wouldn't help the IO speed. – Durandal Jun 14 '13 at 12:28
  • @Durandal baaaahhhhhhhhh true. Good catch. I knew I should have actually run that computation. In that case, Hugh, your limiting factor is now how fast you can load information off the hard drive. For a less accurate, but much faster system, you could hash, say, only the first 100MB of the file. If you do something like this, you absolutely should include secondary and even tertiary methods of checking (total file size, etc.). – Russell Uhl Jun 14 '13 at 12:31
0

MD5 Hash is quite slow. Consider using a faster hash function, such as MurmurHash.

It has a very good collision resistance and it's pretty fast.

Also, you should check the file size first which will take no time and avoids unnecessary hash computations.

loscuropresagio
  • 1,922
  • 15
  • 26
  • 1
    That link notes that the page has been deprecated and that there is a newer version of [MurmurHash](https://code.google.com/p/smhasher/) – Ryaminal Jun 13 '13 at 15:17