Find files with identical content

Question

Answer to my question using Kubator command line :

 #Function that shows the files having the same content in the current directory
showDuplicates (){
  last_file=''
  while read -r f1_hash f1_name; do
    if [ "$last_file" != "$f1_hash" ]; then
      echo "The following files have the exact same content :"
      echo "$f1_name"
      while read -r f2_hash f2_name; do
        if [ "$f1_hash" == "$f2_hash" ] && [ "$f1_name" != "$f2_name" ]; then
          echo "$f2_name"
        fi
      done < <(find ./ -maxdepth 1 -type f -print0 | xargs -0 md5sum | sort -k1,32 | uniq -w32 -D)
    fi
    last_file="$f1_hash"
  done < <(find ./ -maxdepth 1 -type f -print0 | xargs -0 md5sum | sort -k1,32 | uniq -w32 -D)
}

Original question :

I've seen some discussions about what I'm about to ask but I have troubles understanding the mechanics behind the solution proposed and I have not been able to solve my problem that follows.

I want to make a function to compare files, for that, naively, I've tried the following :

#somewhere I use that to get the files paths
files_to_compare=$(find $base_path -maxdepth 1 -type f)
files_to_compare=( $files_to_compare )

#then I pass files_to_compare as an argument to the following function
showDuplicates (){
  files_to_compare=${1}
  n_files=$(( ${#files_to_compare[@]} ))
  for (( i=0; i < $n_files ; i=i+1 )); do
     for (( j=i+1; j < $n_files ; j=j+1 )); do
         sameContent "${files_to_compare[i]}" "${files_to_compare[j]}"
         r=$?
         if [ $r -eq 1 ]; then
            echo "The following files have the same content :"
            echo ${files_to_compare[i]}
            echo ${files_to_compare[j]}
         fi
    done
  done
}

The function 'sameContent' takes the absolute paths of two files and makes use of different commends (du, wc, diff) to return 1 or 0 depending on the files having the same content (respectively).

The incorrectness of that code has showed up with file names containing spaces but I've since read that it's not the way to go to manipulate files in bash.

On https://unix.stackexchange.com/questions/392393/bash-moving-files-with-spaces and some other pages I've read that the correct way to go is to use a code that looks like this :

$ while IFS= read -r file; do echo "$file"; done < files

I seem not to be able to understand what lies behind that bit of code and how I could use it to solve my problem. Particularly due to the fact that I want/need to use intricate loops.

I'm new to bash and it's seems to be a common problem but still if someone was kind enough to give me some insight about how that works that would be wonderful.

p.s.: please excuse the probable grammar mistakes

Collecting `find` output into a string and *then* coneverting that to an array is not going to work. You could directly collect the results into an array and then loop over that; but passing an array as a scalar to a function still won't work. The traditional argument list `"$@"` is probably all you need here anyway. Or go with the checksum idea like in the answer below. — tripleee, Dec 03 '18 at 11:25
Also, pay attention to the second nominated duplicate; basically always use double quotes around file names. — tripleee, Dec 03 '18 at 11:31

Kubator · Accepted Answer · 2018-12-03T11:09:20.603

How about to use md5sum to compare content of files in Your folder instead. That's way safer and standard way. Then You would need only something like this:

find ./ -type f -print0 | xargs -0 md5sum | sort -k1,32 | uniq -w32 -D

What it does:

find finds all files -type f in current folder ./ and output separates by null byte -print0 that's needed for special characters like space in filenames (like You are mentioning moving files with space)
xargs takes output from find separated by null byte -0 and performs md5sum hashes on files
sort sorts output by positions 1-32 (which is md5 hash) -k1,32
uniq makes output unique by first 32 characters (md5 hash) -w32 and filter only duplicated lines -D

Output example:

7a2e203cec88aeffc6be497af9f4891f  ./file1.txt
7a2e203cec88aeffc6be497af9f4891f  ./folder1/copy_of_file1.txt
e97130900329ccfb32516c0e176a32d5  ./test.log
e97130900329ccfb32516c0e176a32d5  ./test_copy.log

If performance is crucial this can be tuned to sort firstly by filesize and only then compare md5sums. Or called mv, rm etc.

Find files with identical content

1 Answers1