-1

Ultimately, I want to get rid of the possibility of duplicate entries showing up my array. The reason I'm doing this is because I'm working on a script that compares two directories, searches for, and deletes duplicate files. The potential duplicate files are stored in an array and the files are only deleted if they have the same name and checksum as the originals. So if there are duplicate entries, I wind up encountering minor errors where md5 either tries to find the checksum of a file that doesn't exist (because it was already deleted) or rm tries to delete a file that was deleted already.

Here's part of the script.

compare()
{

read -p "Please enter two directories: " dir1 dir2

if [[ -d "$dir1" && -d "$dir2" ]]; then
    echo "Searching through $dir2 for duplicates of files in $dir1..."
else
    echo "Invalid entry. Please enter valid directories." >&2
    exit 1
fi

#create list of files in specified directory
while read -d $'\0' file; do
    test_arr+=("$file")
done < <(find $dir1 -print0)

#search for all duplicate files in the home directory
#by name
#find checksum of files in specified directory
tmpfile=$(mktemp -p $dir1 del_logXXXXX.txt)


for i in "${test_arr[@]}"; do
    Name=$(sed 's/[][?*]/\\&/g' <<< "$i")

    if [[ $(find $dir2 -name "${Name##*/}" ! -wholename "$Name") ]]; then
        [[ -f $i ]] || continue
        find $dir2 -name "${Name##*/}" ! -wholename "$Name" >> $tmpfile
        origray[$i]=$(md5sum "$i" | cut -c 1-32)
    fi
done

#create list of duplicate file locations.
dupe_loc

#compare similarly named files by checksum and delete duplicates
local count=0
for i in "${!indexray[@]}"; do
    poten=$(md5sum "${indexray[$i]}" | cut -c 1-32)
    for i in "${!origray[@]}"; do
        if [[ "$poten" = "${origray[$i]}" ]]; then
            echo "${indexray[$count]} is a duplicate of a file in $dir1."
            rm -v "${indexray[$count]}"
            break
        fi
    done
    count=$((count+1))
done
exit 0 
}

dupe_loc is the following function.

dupe_loc()
{
if [[ -s $tmpfile ]]; then
    mapfile -t indexray < $tmpfile
else
    echo "No duplicates were found."
    exit 0
fi
}

I figure the best way to solve this issue would be to use the sort and uniq commands to dispose of duplicate entries in the array. But even with process substitution, I encounter errors when trying to do that.

Alphatron
  • 83
  • 2
  • 10
  • 1
    Can you simplify the problem further? Say you have 2 directories with files and you want to have a 3rd directory with only the unique content from both of the directories? – NinjaGaiden Nov 29 '16 at 03:53
  • 1
    `sort -u -kN,M` should be enough. Way too much code for this problem, please read http://stackoverflow.com/help/mcve before posting more Qs here. Good luck. – shellter Nov 29 '16 at 03:58
  • An easier approach would be to fill `test_arr` with the filename (without path) and once you have `test_arr` filled is simply to loop through the names and `test` if there is a file in dir2 with that name, e.g. `test_arr+=("${file##*/}")`, then `declare -a dups; for i in "${test_arr[@]}"; do [ -f "$dir2/$i" ] && dups+=("$i"); done` You now have the list of duplicates in `dups`. – David C. Rankin Nov 29 '16 at 08:14

1 Answers1

0

First things first. Bash array sorting has been answered here: How to sort an array in BASH

That said, I don't know that sorting the array will be much help. It seems a more simple solution would just be wrapping your md5 check and rm statements in an if statement:

if [ -f origarr[$i]} ]; do #True if file exists and is a regular file.
    #file exists
    ...
    rm ${origarr[$i]}
fi
Community
  • 1
  • 1
Guest
  • 124
  • 7