91

What is the fastest way to create a hash function which will be used to check if two files are equal?

Security is not very important.

Edit: I am sending a file over a network connection, and will be sure that the file on both sides are equal

Jon Seigel
  • 12,251
  • 8
  • 58
  • 92
eflles
  • 6,606
  • 11
  • 41
  • 55
  • 17
    A hash function can't tell you if two files are equal. It can only tell you if two files are *not* equal. If you're only comparing two files once, faster than any hash algorithm will be simple reading the files and comparing them. – jemfinch May 17 '10 at 04:47
  • 13
    @jemfinch: the hash function is a faster way to disprove that files are the same if they are not on the same filesystem. – dotancohen Feb 17 '15 at 16:15
  • 12
    As long as the probability, of the hash failing to disprove that the files are equal, is less than the sum of the probabilities of all the other things that can go wrong (e.g. computer failure), then all is well. For a 256 bit hash it may be more likely that your computer turns in to a cat (larger animals are very unlikely), or a bowl of petunias. – ctrl-alt-delor May 01 '17 at 09:07
  • 3
    You didn't flesh out your use cases for this question but one of them might be as follows: You want to AVOID getting a copy of a LARGE **UNCHANGED** file. Assume a local HASH of a large file and a local large file. Assume the server has a LARGE file and a current HASH for that file. You can download the **server HASH** and see if it matches the local HASH - if so, you don't have to get a new copy of the file. You can ALSO use the HASH and a local algorithm to sanity check the local LARGE file. – Steven the Easily Amused Jan 26 '21 at 00:16
  • 1
    Wow, I have never seen so many people go out of their way to **NOT** answer a question! lol ... I wanted to know this too (in my case, I downloaded a package which contains 8 .exe files with the same filesize and date, so wanted to check their contents/see if they were the same) -- your question *should* have given me what I needed, except no-one provided a usable command/instruction and so after reading a few thousand words I am sadly none the wiser, oh dear, lol! -- anyway, you have my sympathies @eflles – Martin Mar 15 '23 at 20:35

15 Answers15

62

Unless you're using a really complicated and/or slow hash, loading the data from the disk is going to take much longer than computing the hash (unless you use RAM disks or top-end SSDs).

So to compare two files, use this algorithm:

  • Compare sizes
  • Compare dates (be careful here: this can give you the wrong answer; you must test whether this is the case for you or not)
  • Compare the hashes

This allows for a fast fail (if the sizes are different, you know that the files are different).

To make things even faster, you can compute the hash once and save it along with the file. Also save the file date and size into this extra file, so you know quickly when you have to recompute the hash or delete the hash file when the main file changes.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • 4
    I've implemented a working solution that uses alternate data streams under NTFS to store hashes. One thing I had to do, however, was to timestamp the hash so that I could tell if the file had been modified since it had last been hashed. – Steven Sudit Feb 03 '11 at 07:44
  • 3
    Fast disks today can read at 2.5GB per second. Hashes are nowhere near that fast in my experience. – Abhi Beckert Aug 26 '17 at 23:51
  • @AbhiBeckert My argument is: If you have the hashes computed, you don't need to load the whole data set. Also my first sentence is "Unless you're using a really complicated and/or slow hash", isn't it? – Aaron Digulla Sep 04 '17 at 09:39
  • 1
    @AaronDigulla in my case, I'm wanting to check if the contents of a large list of files still match their previously calculated hash, so it needs to be re-calculated. Using sha1 and a fast SSD and a large list of files, hash calculation is pinning all my CPU cores at 100% for an hour or two, causing fans to spin up to maximum speed and clock speed to be throttled to prevent overheating and so on and so on. I came here to find a more efficient hash. I don't think sha1 is complicated or slow as far as strong hashes go, although "really" is a relative term. I tried MD5 with similar results. – Abhi Beckert Sep 05 '17 at 04:18
  • 1
    @AbhiBeckert I see. SHA and MD were designed with crypto in mind (security is more important than speed). This questions might help: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed – Aaron Digulla Sep 25 '17 at 15:47
  • 1
    @AaronDigulla I'm aware of that. My point is that you should clarify your answer since it's currently wrong. Almost all hashes are designed for crypto and they can all be significantly slower than just reading the contents from disk and doing a direct comparison of every byte. Instead of "Unless you're using a really complicated/slow hash ..." your answer should say "if you use X, Y or Z fast hash ...". Also, many of those faster hash functions have unacceptably high collision risks for any file large enough to care about speed. Perhaps add a "compare contents" step where the hashes match. – Abhi Beckert Sep 28 '17 at 06:28
  • @AbhiBeckert When I wrote that answer, SSDs were rare and not that fast. Typically, your CPU cache was 100-1000 times faster than main memory and memory was 1000 times faster than loading data from hard disk. This has changed. That said, your use case is different from this question. Sending the data over the network is very, very slow. Even a very slow hash will outrun gigabit ethernet hands down. I suggest to ask a new question. – Aaron Digulla Sep 29 '17 at 15:32
  • 1
    `xxh128sum` is so much faster for file comparison than using a cryptographic hash function like `sha1`... but I think Aaron implies that in his answer, it's just not explicit. However, for less experienced users, most pre-installed hash command line tools will be optimised for cryptographic security and not speed. xxHash is available in many repositories including Ubuntu (`sudo apt install xxhash`) and OpenBSD (`doas pkg_add -U xxhash`) – moo Mar 29 '22 at 08:03
  • @Abhi Beckert According to this link current high end desktop AMD processors can run SHA @ 64 GB/s https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/17 – paperclip optimizer Oct 22 '22 at 10:28
  • Additionally, you would start by hashing and comparing only the first (or any other) FS / HW block of each file to quickly sieve and discard the vast majority of non-matches at a minimum cost in disk bandwidth and processing power before moving on to perform more thorough comparisons on the surviving list. – paperclip optimizer Oct 22 '22 at 10:38
44

One approach might be to use a simple CRC-32 algorithm, and only if the CRC values compare equal, rerun the hash with a SHA1 or something more robust. A fast CRC-32 will outperform a cryptographically secure hash any day.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • 17
    I'd say that hashing a file is likely to be I/O-bound anyhow, so you might as well use a hash with good distribution and a large range (certainly any crypto hash qualifies). – Steven Sudit Aug 12 '10 at 01:40
  • 28
    I'm going to contradict myself here: if there are just two files of equal length, you're not going to get any faster with hashes than by direct comparison. If you have a number of files and want to find candidates for equality, a hash makes sense. – Steven Sudit Feb 03 '11 at 07:39
  • 10
    If you're comparing files over a network (as the OP is), then reading each file amounts to re-transmitting the file over the network a second time. So using some sort of hashing probably makes sense. But I would agree with using a good hashing algorithm the first time, rather than doing a preliminary CRC32 followed by something else. – Jonathan Hall Jan 20 '16 at 10:01
  • 3
    @StevenSudit it's not IO bound on a fast SSD. I've got a test file where md5 takes a minute but my SSD can read the file in just 25 seconds. And my SSD is a few years old, you can get faster ones now. – Abhi Beckert Aug 26 '17 at 22:51
  • 2
    Even if just comparing locally, if the only result needed is "equal" / "not equal", it probably still makes sense to hash, because that allows the drive/OS to read the file as fast as possible, instead of alternating chunks between 2 files. – hmijail Oct 21 '19 at 14:10
  • @Flimzy - OP *seems* to be describing running the hash *locally* on each side of the connection, which would not amount "to re-transmitting the file over the network a second time" :) – warren Nov 23 '21 at 14:21
  • 1
    @warren: I believe that 5 years ago when I wrote that, I was responding to the immediately preceeding comment which suggested that the fastest approach is simply to read each file and do a direct comparison. I believe the point I was trying to make was that due to the cost of network latency, a hash is probably more time efficient. – Jonathan Hall Nov 23 '21 at 14:40
25

xxhash purports itself as quite fast and strong, collision-wise:

http://cyan4973.github.io/xxHash/

There is a 64 bit variant that runs "even faster" on 64 bit processors than the 32, overall, though slower on 32-bit processors (go figure).

http://code.google.com/p/crcutil is also said to be quite fast (and leverages hardware CRC instructions where present, which are probably very fast, but if you don't have hardware that supports them, aren't as fast). Don't know if CRC32c is as good of a hash (in terms of collisions) as xxHash or not...

https://code.google.com/p/cityhash/ seems similar and related to crcutil [in that it can compile down to use hardware CRC32c instructions if instructed].

If you "just want the fastest raw speed" and don't care as much about quality of random distribution of the hash output (for instance, with small sets, or where speed is paramount), there are some fast algorithms mentioned here: http://www.sanmayce.com/Fastest_Hash/ (these "not quite random" distribution type algorithms are, in some cases, "good enough" and very fast). Apparently FNV1A_Jesteress is the fastest for "long" strings, some others possibly for small strings. http://locklessinc.com/articles/fast_hash/ also seems related. I did not research to see what the collision properties of these are.

Latest hotness seems to be https://github.com/erthink/t1ha and https://github.com/wangyi-fudan/wyhash and xxhash also has a slightly updated version as well.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
  • 2
    "There is a 64 bit variant that runs "even faster" on 64 bit processors than the 32, overall, though slower on 32-bit processors (go figure)." - okay, I figure the 64bit code is optimized for 64bit processors and is using 64bit long integers for chunking the hashing mechanism. – Ben Personick May 17 '19 at 21:13
  • @BenPersonick - it *would* make sense that a 64-bit version would run slower, all other things being equal, on a 32-bit processor than on a 64-bit one... the 32-bit processor will have to bust the 64-bit block size into two pieces instead running it at once :) – warren Nov 23 '21 at 14:23
  • @warren Exactly right that would be the case if possible on a 32bit CPU, however you cant run 64 bit code on a 32bit CPU. I believe he means that running 64 bit code on a 64bit CPU is running faster than running a 32bit version of the program on a 64bit CPU. Thats to be expected as this is a data crunching program so using the larger native 64bit variables would allow quicker action by manipulating 64 bit chunks of data, instead of double the number of 32bit chunks of data. :) – Ben Personick Nov 24 '21 at 01:57
  • @BenPersonick - you can run 256-bit algorithms on a 64-bit processor (eg SHA256). It certainly possible to run 64-bit algorithms on a 32-bit processor (MD5's been around for a lot longer than consumer-grade 64-bit CPUs, and it's a 128-bit algorithm). It makes sense running a "natively-sized" algorithm is going to be fast than one that's *not* natively-sized :) – warren Nov 24 '21 at 13:46
  • It's been a long time and after skimming I think we are misunderstanding each other, there is a difference between coding for 64 bits, and algorithms which are using any arbitrary number of bits. My OG point was that if you write code that uses the 64 bit variables, it would allow a 64 bit processor to utilize it's memory and cache more efficiently, 64 bit code can't run natively on 32 with 64 bit variables, and code that is coded for 32 bit variables will see no benefit running on a 64 bit architecture. – Ben Personick Mar 10 '23 at 15:22
5

You could try MurmurHash, which was specifically designed to be fast, and is pretty simple to code. You might want to and a second, more secure hash if MurmurHash returns a match though, just to be sure.

int3
  • 12,861
  • 8
  • 51
  • 80
  • 1
    The OP stated that security was not a consideration here, so I'm not sure why a second hash would help. Instead, I'd suggest using one of the 64-bit variants of Murmur. – Steven Sudit Aug 12 '10 at 01:42
  • I'm going to contradict myself by suggesting that the newer 128-bit variant is better, and then contradict myself by adding that, for this use case, I'd stick with a proper crypto hash, such as SHA-256. – Steven Sudit Feb 03 '11 at 07:41
  • http://cbloomrants.blogspot.com/2010/08/08-21-10-adler32.html and http://www.strchr.com/hash_functions seem to imply that murmurhash is faster, of only slightly, than adler/crc32. It may all depend on implementation, for instance this sse version says it is a "fast" crc-like hash: http://cessu.blogspot.com/2008/11/hashing-with-sse2-revisited-or-my-hash.html – rogerdpack Jun 21 '12 at 17:56
4

What we are optimizing here is time spent on a task. Unfortunately we do not know enough about the task at hand to know what the optimal solution should be.

Is it for one-time comparison of 2 arbitrary files? Then compare size, and after that simply compare the files, byte by byte (or mb by mb) if that's better for your IO.

If it is for 2 large sets of files, or many sets of files, and it is not a one-time exercise. but something that will happen frequently, then one should store hashes for each file. A hash is never unique, but a hash with a number of say 9 digits (32 bits) would be good for about 4 billion combination, and a 64 bit number would be good enough to distinguish between some 16 * 10^18 Quintillion different files.

A decent compromise would be to generate 2 32-bit hashes for each file, one for first 8k, another for 1MB+8k, slap them together as a single 64 bit number. Cataloging all existing files into a DB should be fairly quick, and looking up a candidate file against this DB should also be very quick. Once there is a match, the only way to determine if they are the same is to compare the whole files.

I am a believer in giving people what they need, which is not always never what they think they need, or what the want.

bjorn
  • 41
  • 1
3

If it's only a one off then given that you'll have to read both files to generate a hash of both of them, why not just read through a small amount of each at a time and compare?

Failing that CRC is a very simple algorithm.

Jeff Foster
  • 43,770
  • 11
  • 86
  • 103
  • +1 for CRC, since the OP asked for "fastest". Of course, then he asked for "making sure the files are the same" which contradicts itself LOL. – rogerdpack Jun 21 '12 at 17:27
  • @rogerdpack crc isn't close to fastest hash, even with asm. – OneOfOne Aug 14 '14 at 00:00
  • 1
    @OneOfOne true I believe I didn't realize that at the time. These days I recommend xxhash or cityhash, see my other answer here http://stackoverflow.com/a/11422479/32453 [apparently with crc32c it can compile down to a CPU instruction which is very fast...though that's not what I was referring to initially here I don't think so your comment is right] – rogerdpack Aug 15 '14 at 16:05
3

For this type of application, Adler32 is probably the fastest algorithm, with a reasonable level of security. For bigger files, you may calculate multiple hash values, for example one per block of 5 Mb of the file, hence decreasing the chances of errors (i.e. of cases when the hashes are same yet the file content differ). Furthermore this multi-hash values setup may allow the calculation of the hash to be implemented in a multi-thread fashion.

Edit: (Following Steven Sudit's remark)
A word of caution if the files are small!
Adler32's "cryptographic" properties, or rather its weaknesses are well known particularly for short messages. For this reason the solution proposed should be avoided for files smaller than than a few kilobytes.
Never the less, in the question, the OP explicitly seeks a fast algorithm and waives concerns about security. Furthermore the quest for speed may plausibly imply that one is dealing with "big" files rather than small ones. In this context, Adler32, possibly applied in parallel for files chunks of say 5Mb remains a very valid answer. Alder32 is reputed for its simplicity and speed. Also, its reliability, while remaining lower than that of CRCs of the same length, is quite acceptable for messages over 4000 bytes.

mjv
  • 73,152
  • 14
  • 113
  • 156
  • I would not recommend Adler32 for any purpose. It has terrible characteristics, particularly for short files. – Steven Sudit Aug 12 '10 at 01:43
  • There are faster algorithms that are nonetheless much better. MurmurHash3 comes to mind, but for this use case, I'd suggest that I/O speed is the limit so SHA-256 would be good. – Steven Sudit Feb 03 '11 at 07:42
  • (Also, please use the comment option instead of editing your remark, else I'll only know about your response if I get lucky.) – Steven Sudit Feb 03 '11 at 07:43
  • apparently adler32 is "bad for numbers" http://www.strchr.com/hash_functions but CRC32 is ok, at least distribution wise. – rogerdpack Jun 21 '12 at 17:50
2

In any case, you should read each file fully (except case when sizes mismatch), so just read both file and compare block-to-block.

Using hash just gain CPU usage and nothing more. As you do not write anything, cache of OS will effectively DROP data you read, so, under Linux, just use cmp tool

socketpair
  • 1,893
  • 17
  • 15
2

Why do you want to hash it?

If you want to make sure that two files are equal then by definition you will have to read the entire file (unless they are literally the same file, in which case you can tell by looking at meta-data on the file system). Anyways, no reason to hash, just read over them and see if they are the same. Hashing will make it less efficient. And even if the hashes match, you still aren't sure if the files really are equal.

Edit: This answer was posted before the question specified anything about a network. It just asked about comparing two files. Now that I know there is a network hop between the files, I would say just use an MD5 hash and be done with it.

tster
  • 17,883
  • 5
  • 53
  • 72
  • 4
    I am sending a file over a network connection, and will be sure that the file on both sides are equal. – eflles Nov 19 '09 at 07:59
  • 3
    Oh, well in that case just use a real hash algorithm. I guarantee your network will be slower than the hash. – Greg Hewgill Nov 19 '09 at 08:01
  • In such a case, use an already existing hash function. Greg, posted some good examples. – tster Nov 19 '09 at 08:01
2

Background:

For file comparison, using a cryptographic-grade hash function such as MD5, SHA-1, SHA-2, SHA-3 etc will be very slow as these tools are optimised for good statistical and security properties over speed.

Unfortunately, many of the tools that some users will be familiar with use the cryptographic hash functions as this is probably the most widespread use of hashing from a users' perspective. So although you can use openssl dgst or sha1, sha256 etc to compare files, it will be very slow. This is especially the case for a large directory of large files, which also happens to be a very typical use case!

For internal applications, you may not be interested in cryptographic properties. Specifically, if you are worried that an adversary may be trying to create a collision on purpose, then you should stick with one of the above algorithms (and avoid MD5 or SHA-1 which are broken).

Benchmarking hash functions:

The SMhasher website has some benchmarks which aid direct performance comparison and notes / weakness, if you have specific needs.

Good compromise:

xxdhash is very fast (at the expense of security) and is perfect for file comparison tasks internally, when security is not of concern. Binaries are widely available and these include command line utilities.

Optimisation: You only need to run the hash on files of the same size: https://unix.stackexchange.com/questions/339491/find-a-file-by-hash

Example use case:

I want to check a large directory of photos to see if some duplicated files have crept in. I'm not integrating with the outside world and, in my use case, there is no chance that a malicious actor will try to add a non-duplicate photo with an identical hash (known as a collision).

Installation:

xxdhash is available in many distributions' repositories. To install on Debian-based distributions (including Ubuntu):

sudo apt update && sudo apt install xxhash

On OpenBSD:

doas pkg_add -U xxdhash

Or from github

Get unique hashes for a full directory of files:

The xxh128sum command line tool should now be available to you. You can combine this with the find command to look for duplicated files:

find . -type f -exec xxh128sum {} \; > hashes.txt

Find duplicates:

You now have a file of hashes and filenames that can be used to find duplicates. List just the filename of the second found duplicate:

awk 'visited[$1]++ { print $2 }' hashes.txt

moo
  • 1,597
  • 1
  • 14
  • 29
1

The following is the code to find duplicate files from my personal project to sort pictures which also removes duplicates. As per my experience, first using fast hashing algo like CRC32 and then doing MD5 or SHA1 was even slower and didn't made any improvement as most of the files with same sizes were indeed duplicate so running hashing twice was more expensive from cpu time perspective, this approach may not be correct for all type of projects but it definitely true for image files. Here I am doing MD5 or SHA1 hashing only on the files with same size.

PS: It depends on Apache commons codec to generate hash efficiently.

Sample usage: new DuplicateFileFinder("MD5").findDuplicateFilesList(filesList);

    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.Collection;
    import java.util.HashMap;
    import java.util.Iterator;
    import java.util.List;
    import java.util.Map;

    import org.apache.commons.codec.digest.DigestUtils;

    /**
     * Finds the duplicate files using md5/sha1 hashing, which is used only for the sizes which are of same size.
     *  
     * @author HemantSingh
     *
     */
    public class DuplicateFileFinder {

        private HashProvider hashProvider;
        // Used only for logging purpose.
        private String hashingAlgo;

        public DuplicateFileFinder(String hashingAlgo) {
            this.hashingAlgo = hashingAlgo;
            if ("SHA1".equalsIgnoreCase(hashingAlgo)) {
                hashProvider = new Sha1HashProvider();
            } else if ("MD5".equalsIgnoreCase(hashingAlgo)) {
                hashProvider = new Md5HashProvider();
            } else {
                throw new RuntimeException("Unsupported hashing algorithm:" + hashingAlgo + " Please use either SHA1 or MD5.");
            }
        }

        /**
         * This API returns the list of duplicate files reference.
         * 
         * @param files
         *            - List of all the files which we need to check for duplicates.
         * @return It returns the list which contains list of duplicate files for
         *         e.g. if a file a.JPG have 3 copies then first element in the list
         *         will be list with three references of File reference.
         */
        public List<List<File>> findDuplicateFilesList(List<File> files) {
            // First create the map for the file size and file reference in the array list.
            Map<Long, List<File>> fileSizeMap = new HashMap<Long, List<File>>();
            List<Long> potDuplicateFilesSize = new ArrayList<Long>();

            for (Iterator<File> iterator = files.iterator(); iterator.hasNext();) {
                File file = (File) iterator.next();
                Long fileLength = new Long(file.length());
                List<File> filesOfSameLength = fileSizeMap.get(fileLength);
                if (filesOfSameLength == null) {
                    filesOfSameLength = new ArrayList<File>();
                    fileSizeMap.put(fileLength, filesOfSameLength);
                } else {
                    potDuplicateFilesSize.add(fileLength);
                }
                filesOfSameLength.add(file);
            }

            // If we don't have any potential duplicates then skip further processing.
            if (potDuplicateFilesSize.size() == 0) {
                return null;
            }

            System.out.println(potDuplicateFilesSize.size() + " files will go thru " + hashingAlgo + " hash check to verify if they are duplicate.");

            // Now we will scan the potential duplicate files, and eliminate false positives using md5 hash check.
            List<List<File>> finalListOfDuplicates = new ArrayList<List<File>>();
            for (Iterator<Long> potDuplicatesFileSizeIterator = potDuplicateFilesSize
                    .iterator(); potDuplicatesFileSizeIterator.hasNext();) {
                Long fileSize = (Long) potDuplicatesFileSizeIterator.next();
                List<File> potDupFiles = fileSizeMap.get(fileSize);
                Map<String, List<File>> trueDuplicateFiles = new HashMap<String, List<File>>();
                for (Iterator<File> potDuplicateFilesIterator = potDupFiles.iterator(); potDuplicateFilesIterator
                        .hasNext();) {
                    File file = (File) potDuplicateFilesIterator.next();
                    try {
                        String md5Hex = hashProvider.getHashHex(file);
                        List<File> listOfDuplicatesOfAFile = trueDuplicateFiles.get(md5Hex);
                        if (listOfDuplicatesOfAFile == null) {
                            listOfDuplicatesOfAFile = new ArrayList<File>();
                            trueDuplicateFiles.put(md5Hex, listOfDuplicatesOfAFile);
                        }
                        listOfDuplicatesOfAFile.add(file);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                Collection<List<File>> dupsOfSameSizeList = trueDuplicateFiles.values();
                for (Iterator<List<File>> dupsOfSameSizeListIterator = dupsOfSameSizeList.iterator(); dupsOfSameSizeListIterator
                        .hasNext();) {
                    List<File> list = (List<File>) dupsOfSameSizeListIterator.next();
                    // It will be duplicate only if we have more then one copy of it.
                    if (list.size() > 1) {
                        finalListOfDuplicates.add(list);
                        System.out.println("Duplicate sets found: " + finalListOfDuplicates.size());
                    }
                }
            }

            return finalListOfDuplicates;
        }

        abstract class HashProvider {
            abstract String getHashHex(File file) throws IOException ;
        }

        class Md5HashProvider extends HashProvider {
            String getHashHex(File file) throws IOException {
                return DigestUtils.md5Hex(new FileInputStream(file));
            }
        }
        class Sha1HashProvider extends HashProvider {
            String getHashHex(File file) throws IOException {
                return DigestUtils.sha1Hex(new FileInputStream(file));
            }
        }
    }
Hemant
  • 4,537
  • 8
  • 41
  • 43
1

you might check out the algorithm that the samba/rsync developers use. I haven't looked at it in depth, but i see it mentioned all the time. apparently its quite good.

clarson
  • 556
  • 2
  • 7
  • 15
  • rsync is actually using a "rolling checksum" version of the Adler32 algorithm, as of Wikipedia: https://en.wikipedia.org/wiki/Adler-32 – Bigue Nique Jun 16 '14 at 12:28
0

You can do it all in a single step - why sort when none is needed until the duplicated list is found :

nice find . -type f -print0 \
\
| xargs -0 -P 8 xxh128sum --tag | pvZ -i 0.5 -l -cN in0 \
\
| mawk2 'BEGIN {

   _=(FS="(^XXH(32|64|128)[ ][(]|[)][ ]["(OFS="=")"][ ])")<"" 

  } __[$(NF=NF)]--<-_ ' |  pvZ -i 0.5 -l -cN out9 \
                                                  \
  | LC_ALL=C gsort -f -t= -k 3,3n -k 2,2 | gcat -n | lgp3 3

The same logic used to locate unique items in awk via its associative hashed array feature can also be leveraged to find duplicates - it's just the flip side of the same coin - only sort when the list is much smaller.

I got output like this in my own folder filled to the brim with duplicates :

 24131  =./songChunk_93634773_0011_.8045v202108091628512738.ts=56075211016703871f208f88839e2acc
 24132  =./songChunk_93634773_0011_.8045v202108091628512772.ts=56075211016703871f208f88839e2acc

 24133  =./songChunk_93634773_0011_.8045v202108091628512806.ts=56075211016703871f208f88839e2acc
 24134  =./songChunk_93634773_0011_.8045v202108091628512839.ts=56075211016703871f208f88839e2acc
 24135  =./songChunk_93666774_0043_.7102v202108091628512485.ts=77643645774287386a02e83808a632ed

 24136  =./songChunk_93666774_0043_.7102v202108091628536916.ts=77643645774287386a02e83808a632ed
 24137  =./songChunk_92647129_0023_.8045v202108091628536907.ts=146289716096910587a15001b5d2e9d6
 24138  =./songChunk_92647129_0023_.8045v202108091628536946.ts=146289716096910587a15001b5d2e9d6

That said, the major limitation of my lazy approach is that the first file with the same hash it sees is the one it keeps, so if you care about timestamps and naming and all that, then yes you'll have to do a side-by-side call to stat to get you all the precise timestamps and inode numbers and all that TMI it offers.

— The 4Chan Teller

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
0

the most humane way to compare 2 files that are exactly the same is to compare the hashes, the most suitable algorithm for file hashing needs without burdening the processor's work according to my test results the "fnv1a64" algorithm is the best, no need to add additional applications, almost all software versions for server supports this algorithm.

TOTAL FILE TO SCAN 877958

TOTAL EXECUTION TIME 22353.953384876

EXECUTION TIME ALGO
287.33039116859 xxh64
287.48988413811 murmur3f
288.99125409126 xxh128
289.31353402138 xxh3
289.95356488228 murmur3a
291.07013106346 crc32
292.12370014191 murmur3c
294.61375999451 xxh32
294.80482602119 crc32c
295.07662701607 adler32
300.95663881302 crc32b
302.49028491974 joaat
303.52309799194 fnv132
305.78797912598 md4
306.00277495384 fnv164
307.34593200684 fnv1a32
308.21020579338 tiger192,3
308.65456604958 md5
311.2399160862 tiger160,4
311.49055314064 sha1
311.90242314339 fnv1a64
312.08589076996 tiger160,3
314.24259185791 tiger128,3
315.16137599945 tiger192,4
315.28597712517 haval224,3
316.56637001038 haval160,3
317.15466785431 haval128,3
319.63687586784 haval192,3
320.57202386856 sha512
320.92411613464 tiger128,4
321.84568786621 haval256,3
322.48667311668 sha384
327.83715701103 haval192,4
328.23257398605 haval160,4
329.07334899902 haval224,4
330.29799699783 haval256,4
331.03335189819 sha3-256
332.09470105171 ripemd256
332.61167287827 haval128,4
332.66794395447 ripemd128
333.23926591873 sha512/256
336.2356030941 sha3-384
337.06548285484 sha3-224
337.61635899544 sha512/224
337.90623784065 sha256
338.38005399704 haval128,5
338.39897894859 haval160,5
345.30661702156 haval224,5
347.6267209053 sha224
348.70697999001 ripemd320
349.56418085098 haval192,5
355.15771198273 ripemd160
364.52631402016 sha3-512
367.94311404228 haval256,5
405.92915797234 whirlpool
473.47747612 gost
482.507365942 gost-crypto
687.59661793709 snefru256
689.45465397835 snefru
2137.421407938 md2
-1

I remember the old modem transfer protocols, like Zmodem, would do some sort of CRC compare for each block as it was sent. CRC32, if I remember ancient history well enough. I'm not suggesting you make your own transfer protocol, unless that's exactly what you're doing, but you could maybe have it spot check a block of the file periodically, or maybe doing hashes of each 8k block would be simple enough for the processors to handle. Haven't tried it, myself.

Ricochetv1
  • 19
  • 1