Delete files in one directory that do not exist in another directory or its child directories

Question

I am still a newbie in shell scripting and trying to come up with a simple code. Could anyone give me some direction here. Here is what I need.

Files in path 1: /tmp
100abcd
200efgh 
300ijkl

Files in path2: /home/storage
backupfile_100abcd_str1
backupfile_100abcd_str2
backupfile_200efgh_str1
backupfile_200efgh_str2
backupfile_200efgh_str3

Now I need to delete file 300ijkl in /tmp as the corresponding backup file is not present in /home/storage. The /tmp file contains more than 300 files. I need to delete the files in /tmp for which the corresponding backup files are not present and the file names in /tmp will match file names in /home/storage or directories under /home/storage.

Appreciate your time and response.

David C. Rankin · Answer 1 · 2022-02-08T01:02:10.387

2

You can also approach the deletion using grep as well. You can loop though the files in /tmp checking with ls piped to grep, and deleting if there is not a match:

#!/bin/bash

[ -z "$1" -o -z "$2" ] && {  ## validate input
    printf "error: insufficient input. Usage: %s tmpfiles storage\n" ${0//*\//}
    exit 1
}

for i in "$1"/*; do
    fn=${i##*/}  ## strip path, leaving filename only
    
    ## if file in backup matches filename, skip rest of loop
    ls "${2}"* | grep -q "$fn" &>/dev/null && continue
    
    printf "removing %s\n" "$i"
    # rm "$i" ## remove file
done

Note: the actual removal is commented out above, test and insure there are no unintended consequences before preforming the actual delete. Call it passing the path to tmp (without trailing /) as the first argument and with /home/storage as the second argument:

$ bash scriptname /path/to/tmp /home/storage

edited Feb 08 '22 at 01:02

answered Sep 10 '15 at 00:35

David C. Rankin

81,885
6
58
85

I considered that initially, but since it would be far slower (running both `ls` and `grep` for *each* filename), took the hint from the linux+shell tags. – Thomas Dickey Sep 10 '15 at 10:00
The other alternative would be to hold all filenames in `/home/storage` in an array and then simply `grep -q "$fn" &>/dev/null <<<${array[@]}` using a **herestring** to avoid `ls` on each file. – David C. Rankin Sep 10 '15 at 15:24
Thank you Thomas and David. Really appreciate your response on this. Saved me a lot of time :-). Long way to go for me. I tried David's code and it worked. Will try the other one too. Thanks again. – user3356554 Sep 10 '15 at 18:00
1

Hey just wanted to point out the above code will have issues with file and folder names containing spaces. To fix this issue, simply change the line `ls ${2}* | grep -q $fn` to `ls "${2}" | grep -q "$fn"` – Eagnir Feb 07 '22 at 09:43
1

@Eagnir - good catch. Always worth making an old post better. I know better. When you get enough rep to edit, at least with my shell posts, feel free to fix minor oversights like that. – David C. Rankin Feb 08 '22 at 01:01

score 0 · Answer 2 · edited May 23 '17 at 10:33

You can solve this by

making a list of the files in /home/storage
testing each filename in /tmp to see if it is in the list from /home/storage

Given the linux+shell tags, one might use bash:

make the list of files from /home/storage an associative array
make the subscript of the array the filename

Here is a sample script to illustrate ($1 and $2 are the parameters to pass to the script, i.e., /home/storage and /tmp):

#!/bin/bash
declare -A InTarget

while read path
do
    name=${path##*/}
    InTarget[$name]=$path
done < <(find $1 -type f)

while read path
do
    name=${path##*/}
    [[ -z ${InTarget[$name]} ]] && rm -f $path
done < <(find $2 -type f)

It uses two interesting shell features:

name=${path##*/} is a POSIX shell feature which allows the script to perform the basename function without an extra process (per filename). That makes the script faster.
done < <(find $2 -type f) is a bash feature which lets the script read the list of filenames from find without making the assignments to the array run in a subprocess. Here the reason for using the feature is that if the array is updated in a subprocess, it would have no effect on the array value in the script which is passed to the second loop.

For related discussion:

score 0 · Answer 3 · answered May 27 '21 at 15:12

I spent some really nice time on this today because I needed to delete files which have same name but different extensions, so if anyone is looking for a quick implementation, here you go:

 #!/bin/bash 
 
 # We need some reference to files which we want to keep and not delete, 
 # let's assume you want to keep files in first folder with jpeg, so you 
 # need to map it into the desired file extension first.  
 FILES_TO_KEEP=`ls -1 ${2} | sed 's/\.pdf$/.jpeg/g'`
 #iterate through files in first argument path
    for file in ${1}/*; do
        # In my case, I did not want to do anything with directories, so let's continue cycle when hitting one.
        if [[ -d $file ]]; then
         continue
        fi 
        # let's omit path from the iterated file with baseline so we can compare it to the files we want to keep
        NAME_WITHOUT_PATH=`basename $file` 
        # I use mac which is equal to having poor quality clts
        # when it comes to operating with strings,
        # this should be safe check to see if FILES_TO_KEEP contain NAME_WITHOUT_PATH
        if [[ $FILES_TO_KEEP == *"$NAME_WITHOUT_PATH"* ]];then
            echo "Not deleting: $NAME_WITHOUT_PATH"
        else
        # If it does not contain file from the other directory, remove it.
            echo "deleting: $NAME_WITHOUT_PATH"
            rm -rf $file
        fi
    done

Usage: sh deleteDifferentFiles.sh path/from/where path/source/of/truth

Delete files in one directory that do not exist in another directory or its child directories

3 Answers3