How do I find duplicate files by comparing them by size (ie: not hashing) in bash.
Testbed files:
-rw-r--r-- 1 usern users 68239 May 3 12:29 The W.pdf
-rw-r--r-- 1 usern users 68239 May 3 12:29 W.pdf
-rw-r--r-- 1 usern users 8 May 3 13:43 X.pdf
Yes, files can have spaces (Boo!).
I want to check files in the same directory, move the ones which match something else into 'these are probably duplicates' folder.
My probable use-case is going to have humans randomly mis-naming a smaller set of files (ie: not generating files of arbitrary length). It is fairly unlikely that two files will be the same size and yet be different files. Sure, as a backup I could hash and check two files of identical size. But mostly, it will be people taking a file and misnaming it / re-adding it to a pile, of which it is already there.
So, preferably a solution with widely installed tools (posix?). And I'm not supposed to parse the output of ls
, so I need another way to get actual size (and not a du
approximate).
"Vote to close!"
Hold up cowboy.
I bet you're going to suggest this (cool, you can google search):
https://unix.stackexchange.com/questions/71176/find-duplicate-files
No fdupes
(nor jdupes
, nor...), nor finddup
, nor rmlint
, nor fslint
- I can't guarantee those on other systems (much less mine), and I don't want to be stuck as customer support dealing with installing them on random systems from now to eternity, nor even in getting emails about that sh...stuff and having to tell them to RTFM and figure it out. Plus, in reality, I should write my script to test functionality of what is installed, but, that's beyond the scope.
https://unix.stackexchange.com/questions/192701/how-to-remove-duplicate-files-using-bash
All these solutions want to start by hashing. Some cool ideas in some of these: hash just a chunk of both files, starting somewhere past the header, then only do full compare if those turn up matching. Good idea for double checking work, but would prefer to only do that on the very, very few that actually are duplicate. As, looking over the first several thousand of these by hand, not one duplicate has been even close to a different file.
https://unix.stackexchange.com/questions/277697/whats-the-quickest-way-to-find-duplicated-files
Proposed:
$find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Breaks for me:
find: unknown option -- n
usage: find [-dHhLXx] [-f path] path ... [expression]
uniq: unknown option -- w
usage: uniq [-ci] [-d | -u] [-f fields] [-s chars] [input_file [output_file]]
find: unknown option -- t
usage: find [-dHhLXx] [-f path] path ... [expression]
xargs: md5sum: No such file or directory
Haven't been able to figure out how rsync -nrvc --delete
might work in the same directory, but there might be solution in there.
Well how about cmp
? Yeah, that looks pretty good, actually!
cmp -z file1 file2
Bummer, my version of cmp
does not include the -z
size option.
However, I tried implementing it just for grins - and when it failed, looking at it I realized that I also need help constructing my loop logic. Removing things from my loops in the midst of processing them is probably a recipe for breakage, duh.
if [ ! -d ../Dupes/ ]; then
mkdir ../Dupes/ || exit 1 # Cuz no set -e, and trap not working
fi
for i in ./*
do
for j in ./*
do
if [[ "$i" != "$j" ]]; then # Yes, it will be identical to itself
if [[ $(cmp -s "$i" "$j") ]]; then
echo "null" # Cuz I can't use negative of the comparison?
else
mv -i "$i" ../Dupes/
fi
fi
done
done
Might have something I could use, but I'm not following what's going on in there.
https://superuser.com/questions/259148/bash-find-duplicate-files-mac-linux-compatible
If it were something that returns size, instead of md5, maybe one of the answers in here?
Didn't really get answered.
TIL: Sending errors from . scriptname
will close my terminal instantly. Thanks, Google!
TIL: Sending errors from scripts executed via $PATH will close the terminal if shopt -s extdebug
+ trap checkcommand DEBUG
are set in profile to try and catch rm -r *
- but at least will respect my alias for exit
TIL: Backticks deprecated, use $(things) - Ugh, so much re-writing to do :P
TIL: How to catch non-ascii characters in filenames, without using basename
TIL: "${file##*/}"
TIL: file
- yes, X.pdf is not a PDF.