how to speed up checking if file exists in bash

Question

I'm new at Bashing and wrote a code to check my photos files but find it very slow and gets a few empty returns checking 17000+ photos. Is there any way to use all 4 cpus running this script and so speed it up

Please help

#!/bin/bash
readarray -t array < ~/Scripts/ourphotos.txt
totalfiles="${#array[@]}"
echo $totalfiles
i=0
ii=0
check1=""
while : 
do

check=${array[$i]}
if [[ ! -r $( echo $check ) ]] ; then
    if [ $check = $check1 ]; then
     echo "empty "$check
    else
    unset array[$i]
    ii=$((ii + 1 ))
    fi
fi
if [ $totalfiles = $i ]; then
break
fi
i=$(( i + 1 ))
done 

if [ $ii -gt "1" ]; then
 notify-send -u critical $ii" files have been deleted or are unreadable"
 fi

first you should analyze where is the bottleneck. I assume check time can be improved by tweaking/changing filesystem. It's good to measure where file exist check spends most time. There is not much you can improve in bash, maybe you can add cache and check only difference. Maybe rewriting in C/C++ would boost performance. — Piotr Król, Jan 07 '16 at 12:48
Thanks for replying Piotr Krol. I'm reasonably new to scripting and don't know C/C++, only a little autoit and some basic from years ago. I was hoping to use my multi core processor to power through the code but have no idea how to do that apart that I need some kind of statement with args in it. As to the filing system, the array has been shuffled, for random display of my photos for the screensaver I have designed and is going back and forth over the Picture directory, I don't know how to unshuffle the array. — Jeshu, Jan 07 '16 at 21:14
Lot of useful sort code for bash you can find [here](http://stackoverflow.com/questions/7442417/how-to-sort-an-array-in-bash) — Piotr Król, Jan 07 '16 at 21:25
`$(echo $check)` is doubly incorrect -- the [useless use of `echo`](http://www.iki.fi/era/unix/award.html#echo) is exacerbated by the lack of proper quoting. You want simply `if [[ ! -r "$check" ]]`. See also http://shellcheck.net/ for this type of diagnostics. — tripleee, Apr 28 '16 at 05:59
@PiotrKról "not much you can improve" isn't really true at all. Reading the file into an array when you only use it once is a huge overcomplication and does not scale to large files. Unsetting elements in the array but then ignoring its size and using a simple counter variable instead is just wacky. And that doesn't address the syntax problems. — tripleee, Apr 28 '16 at 06:03

webb · Answer 1 · 2023-06-06T11:41:15.053

This is an old question, but a common problem lacking an evidence-based solution.

awk '{print "[ -e "$1" ] && echo "$2}' | parallel    # 400 files/s
awk '{print "[ -e "$1" ] && echo "$2}' | bash        # 6000 files/s
while read file; do [ -e $file ] && echo $file; done # 12000 files/s
xargs find                                           # 200000 files/s
parallel --xargs find                                # 250000 files/s
xargs -P2 find                                       # 400000 files/s
xargs -P96 find                                      # 800000 files/s
xargs -P96 stat --format "%n"                        # ~5x as fast! (Added 2023-06-06)

I tried this on a few different systems and the results were not consistent, but xargs -P (parallel execution) was consistently the fastest. I was surprised that xargs -P was faster than GNU parallel (not reported above, but sometimes much faster), and I was surprised that parallel execution helped so much — I thought that file I/O would be the limiting factor and parallel execution wouldn't matter much.

Also noteworthy is that xargs find is about 20x faster than the accepted solution, and much more concise. For example, here is a rewrite of OP's script:

#!/bin/bash

total=$(wc -l ~/Scripts/ourphotos.txt | awk '{print $1}')

# tr '\n' '\0' | xargs -0 handles spaces and other funny characters in filenames
found=$(cat ~/Scripts/ourphotos.txt | tr '\n' '\0' | xargs -0 -P4 find | wc -l)

if [ $total -ne $found ]; then
  ii=$(expr $total - $found)
  notify-send -u critical $ii" files have been deleted or are unreadable"
fi

UPDATE 2023-06-06: as @RARE-Kpop-Manifesto suggests, stat is faster than find. For example, with 8 cores, I found

cat files.list | tr '\n' '\0' | xargs -0 -P8 stat --format "%n"

to be about 5 times faster than

cat files.list | tr '\n' '\0' | xargs -0 -P8 find

@`webb` it's not surprising for `xargs` to be faster than `gnu parallel` - I've noticed , quite often in fact, that `parallel` is spawning new `perl` processes nonstop to handle each new block/chunk of requests - and `perl`s aren't exactly light weight these days, so that's a lot of overhead — RARE Kpop Manifesto, Jun 02 '23 at 01:19
@`webb` : it also got me thinking - would something like `xargs realpath` or `xargs stat` locate files faster than `find` — RARE Kpop Manifesto, Jun 02 '23 at 01:23
@RARE-Kpop-Manifesto, you are correct! added improvement to answer. — webb, Jun 06 '23 at 11:41

score 1 · Answer 2 · answered Apr 28 '16 at 05:31

It's a filesystem operation so multiple cores will hardly help. Simplification might:

while read file; do 
   i=$((i+1)); [ -e "$file" ] || ii=$(ii+1)); 
done < "$HOME/Scripts/ourphotos.txt"
#...

Two points:

you don't need to keep the whole file in memory (no arrays needed)
$( echo $check ) forks a proces. You generally want to avoid forking and execing in loops.

how to speed up checking if file exists in bash

2 Answers2