Faster way to list files with similar names (using bash)?

Question

I have a directory with more than 20K files all with a random number prefix (eg 12345--name.jpg). I want to find files with similar names and remove all but one. I don't care which one because they are duplicates.

To find duplicated names I've use

find . -type f \( -name "*.jpg" \) | | sed -e 's/^[0-9]*--//g' | sort | uniq -d

as the list of a for/next loop.

To find all but one to delete, I'm currently using

rm $(ls -1 *name.jpg | tail -n +2)

This operation is pretty slow. I want to speed this up. Any suggestions?

Why do you use parens around the -name parameter? May your files and paths contain blanks or newlines? You have files like 17--Peter.jpg and 239--Peter.jpg and 34-Lizzy.jpg and 239--Lizzy.jpg and want to keep one Peter.jpg, one Lizzy.jpg? Are subdirectories involved? — user unknown, Mar 13 '18 at 07:12
Simple is best - Copy one file over, remove the rest - move one file back. — itChi, Mar 13 '18 at 07:45

score 0 · Answer 1 · answered Mar 13 '18 at 06:59

I would do it like this.

*Note that you are dealing with rm command, so make sure that you have backup of the existing directory in case something goes south.

Create a backup directory and take backup of existing files. Once done check if all the files are there.
```
mkdir bkp_dir;cp *.jpg /bkp_dir
```

Create another temp directory where we will keep all only 1 file for each similar name. So all unique file names will be here.

$ mkdir tmp
$ for i in $(ls -1 *.jpg|sed 's/^[[:digit:]].*--\(.*\.jpg\)/\1/'|sort|uniq);do cp $(ls -1|grep "$i"|head -1) tmp/ ;done

*Explanation of the command is at the last. Once executed, check in /tmp directory if you got unique instances of the files.

Remove all *.jpg files from main directory. Saying again, please verify that all files have been backed up before executing rm command.
```
rm *.jpg
```
Backup the unique instances from the temp directory.
```
cp tmp/*.jpg .
```

Explanation of command in step 2.

Command to get unique file names for step 2 will be

for i in $(ls -1 *.jpg|sed 's/^[[:digit:]].*--$.*\.jpg$/\1/'|sort|uniq);do cp $(ls -1|grep "$i"|head -1) tmp/ ;done
$(ls -1 *.jpg|sed 's/^[[:digit:]].*--$.*\.jpg$/\1/'|sort|uniq) will get the unique file names like file1.jpg , file2.jpg
for i in $(...);do cp $(ls -1|grep "$i"|head -1) tmp/ ;done will copy one file for each filename to tmp/ directory.

user unknown · Answer 2 · 2018-03-13T08:23:55.097

0

Assuming no subdirectories and no whitespace-in-filenames involved:

find . -type f -name "*.jpg" | sed -e 's/^[0-9]*--//' | sort | uniq -d > namelist 
removebutone () { shift; echo rm "$@"; }; cat namelist | while read n; do removebutone "*--$n"; done

or, better readable:

removebutone () { 
  shift
  echo rm "$@"
}
cat namelist | while read n; do removebutone "*--$n"; done

Shift takes the first parameter from $* off.

Note that the parens around the name parmeter are superflous, and that there shouldn't be two pipes before sed. Maybe you had something else there, which needed to be covered.

If it looks promising, you have, of course, to remove the 'echo' in front of 'rm'.

edited Mar 13 '18 at 08:23

answered Mar 13 '18 at 07:36

user unknown

35,537
11
75
121

What's with using unquoted `$*` instead of the proper `"$@"` and what's with the [useless `cat`?](https://stackoverflow.com/questions/11710552/useless-use-of-cat) – tripleee Mar 13 '18 at 07:40
Unquoted `$*` where addressed by my precondition but it doesn't hurt to change that. What kind of problem do you expect from a single `cat` call? It's useful for testing: First you just call `cat namelist | less` to see, that it looks as expected, then you remove the less and add a while. If you experience performance problems and identify them as being related to `cat`, feel free to share your findings. – user unknown Mar 13 '18 at 08:05
The linked question is pretty exhaustive. – tripleee Mar 13 '18 at 08:06
You now have a dash after `echo rm` in the cleaned-up version of the function. – tripleee Mar 13 '18 at 08:07

score 0 · Answer 3 · answered Mar 13 '18 at 07:48

You should not be using ls in scripts and there is no reason to use a separate file list like in userunknown's reply.

keepone () {
    shift
    rm "$@"
}
keepone *name.jpg

If you are running find to identify the files you want to isolate anyway, traversing the directory twice is inefficient. Filter the output from find directly.

find . -type f -name "*.jpg" |
awk '{ f=$0; sub(/^[0-9]*--/, "", f); if (a[f]++) print }' |
xargs echo rm

Take out the echo if the results look like what you expect.

As an aside, the /g flag to sed is useless for a regex which can only match once. The flag says to replace all occurrences on a line instead of the first occurrence on a line, but if there can be only one, the first is equivalent to all.

Faster way to list files with similar names (using bash)?

3 Answers3