3

TL;DR How to filter an ls/find output using grep with an array as a pattern?

Background story: I have a pipeline which I have to rerun for datasets which run into an error. Which datasets are run into an error is saved in a tab separated file. I want to delete the files where the pipeline has run into an error.

To do so I extracted the dataset names from another file containing the finished dataset and saved them in a bash array {ds1 ds2 ...} but now I am stuck because I cannot figure out how to exclude the datasets in the array from my deletion step.

This is the folder structure (X=1-30): datasets/dsX/results/dsX.tsv

Not excluding the finished datasets, meaning deleting the folders of the failed and the finished datasets works like a charm

#1. move content to a trash folder
ls /datasets/*/results/*|xargs -I '{}' mv '{}' ./trash/

#2. delete the empty folders
find /datasets/*/. -type d -empty -delete

But since I want to exclude the finished datasets I thought it would be clever to save them in an array:

#find finished datasets by extracting the dataset names from a tab separated log file
mapfile -t -s 1 finished < <(awk '{print $2}' $path/$log_pf)
echo ${finished[@]}

which works as expected but now I am stuck in filtering the ls output using that array: *pseudocode

#trying to ignore the dataset in the array - not working
ls -I${finished[@]} -d /datasets/*/
#trying to reverse grep for the finished datasets - not working
ls /datasets/*/ | grep -v {finished}

What do you think about my current ideas? Is this possible using bash only? I guess in python I could do that easily but for training purposes, I want to do it in bash.

Ivo Leist
  • 408
  • 3
  • 12
  • See https://mywiki.wooledge.org/ParsingLs and https://mywiki.wooledge.org/Quotes for issues with your code beyond the problem you're asking about. Can your file names contain blank chars? – Ed Morton Jun 14 '19 at 11:43

2 Answers2

4

grep can get the patterns from a file using the -f option. Note that file names containing newlines will cause problems.

If you need to process the input somehow, you can use process substitution:

grep -f <(process the input...)
choroba
  • 231,213
  • 25
  • 204
  • 289
  • I know but the file is a tab separated file with several columns, therefore I am extracting the dataset names column and save it in an array – Ivo Leist Jun 13 '19 at 18:17
  • 1
    Extending the answer, use `-f` with a process substitution: `grep -f <(printf "%s\n" "${finished[@]}")` – glenn jackman Jun 13 '19 at 18:22
  • @glenn jackman thank you for the quick extension of the other comment. Seems to work :) If you want the points you can add it as an extra answer otherwise I would accept the answer of choroba. – Ivo Leist Jun 13 '19 at 18:59
1

I must admit I'm confused about what you're doing but if you're just trying to produce a list of files excluding those stored in column 2 of some other file and your file/directory names can't contain spaces then that'd be:

find /datasets -type f | awk 'NR==FNR{a[$2]; next} !($0 in a)' "$path/$log_pf" -

If that's not all you need then please edit your question to clarify your requirements and add concise testable sample input and expected output.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Hello Ed, sorry for coming back so late... At first thank you for sharing your wiki this convinced me to switch from parsing "ls" to using "find" :). However, since I in the original question asked for how to grep an array I accepted @choroba – Ivo Leist Jun 21 '19 at 08:58
  • Do you mind to elaborate in pseudo code what this AWK function does? I'm still a bash novice – Ivo Leist Jun 21 '19 at 09:11
  • 1
    Sorry, it's been too long so I don't remember what the question was about and don't want to re-learn it. Basically though it's saving some field of a file in an array and then if the output of find does not in the array (i.e. was the 2nd field of that file) then it prints the find output for that line. – Ed Morton Jun 21 '19 at 13:16
  • Fair enough, I figured it out in the meanwhile: if someone needs to understand it is as well look in the link below at the answer of Walter A. he has written a brilliant takedown of this oneliner. https://stackoverflow.com/questions/32481877/what-is-nr-fnr-in-awk – Ivo Leist Jun 21 '19 at 16:00