0

I have a scenario where I need to execute a series of commands on each file that's found. This normally would work great, except I have over 100 files and folders to exclude from find's results for execution. This becomes unwieldy and non-executable from the shell directly. It seems like it would be optimal to use an "exclusion file" similar to how tar or grep allows for such files.

Since find does not accept a file for exclusion, but grep does, I want to know: how can the following be converted to a command that would replace the exclusion (prune) and exec functions in find to instead utilize grep with an exclusion file (grep -v -f excludefile) to exclude the folders and files and then execute a series of commands on the result like the current command does it:

find $IN_PATH -regextype posix-extended \
  -regex "/(excluded1|excluded2|excluded3|...|excludedN)" -prune \
  -o -type f \
  -exec sh -c "( cmd -with_args 1 '{}'; cmd -args2 '{}'; cmd3 '{}') \
    | cmd4 | cmd5 | cmd6; cmd7 '{}'" \; \
  > output

As a side note (not critical), I've read that if you don't use exec this process becomes much less efficient and this process is already consuming over 100 minutes to execute each time that it's run, so I don't want to slow it down any more than is necessary.

ylluminate
  • 12,102
  • 17
  • 78
  • 152

2 Answers2

1

the best way i think of to fulfill your scenario , is split the one-liner to two line and introduce xargs with parallel .

find $IN_PATH -regextype posix-extended \
  -regex "/(excluded1|excluded2|excluded3|...|excludedN)" -prune \
  -o -type f  > /tmp/full_file_list
cat /tmp/full_file_list|grep -f excludefile |xargs -0 -n 1 -P <nr_procs> sh -c 'command here' >output

see Bash script processing limited number of commands in parallel and Doing parallel processing in bash? to learn more about parallel in bash

finding and command on files are facing disk-io conflicts in one liner , spilt the one-liner could speed up the process a little bit ,

hint: remember to put your full_file_list/excludefile/output in your exclude rules , and always debug your command on a smaller directory to reduce waiting time

James Li
  • 469
  • 3
  • 7
  • Right, thanks for that reminder. I was so stuck on keeping it all together that I didn't think of parallelizing the operation, which makes sense given that computation and disk reading parallelization should speed things up at least slightly. I'm curious: I'm hitting a hard-to-debug issue on one of the commands that, in this case, will be going through xargs now. (A rare `unexpected end of file`) Wonder if there's a way to capture an error and print out which file the error comes from? Perhaps I need to expand this out into some additional lines... – ylluminate Sep 01 '19 at 16:59
  • I'm having a strange issue with `xargs`. `printf "%s" "$list" > output` gives a clean list of output, but when I pass it through `xargs` I see spaces in file paths broken into newlines (ie, `printf "%s" "$list" | xargs -n 1 -P 1 sh -c 'echo "$0"'`). Am I missing some side effect of `xargs`? The "$list" variable seems to output fine and `-0` on xargs doesn't work at all since apparently newlines are used from the grep output. – ylluminate Sep 01 '19 at 23:26
  • better to update your question with simplified but workable piece of code, with the unexpected output – James Li Sep 02 '19 at 00:03
  • Yes, thanks. I've [simplified the code into a chunk that hopefully illustrates it sufficiently here](https://stackoverflow.com/questions/57750498/why-would-xargs-split-input-on-spaces-and-how-to-resolve-it). – ylluminate Sep 02 '19 at 01:05
0

Why not simply:

find . -type f |
grep -v -f excludefile |
xargs whatever

With respect to this process is already consuming over 100 minutes to execute - that's almost certainly a problem with whatever command line you wrote to replace whatever above and we could probably help you improve that if you post a separate question.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • AFAIK, the asker is trying to build a md5sum snapshot of the file system, which is definitely a time costing process – James Li Sep 02 '19 at 00:01
  • I don't see any indication of what they're trying to do so you may be right, idk, I'm just assuming there's a better way than `sh -c "( cmd -with_args 1 '{}'; cmd -args2 '{}'; cmd3 '{}') \ | cmd4 | cmd5 | cmd6; cmd7 '{}'"` no matter what all those `cmd`s are. – Ed Morton Sep 02 '19 at 13:13