-1

So, I wrote a bad shell script (according to several questions, one of which I asked) and now I am wondering which way to go to perform the same, or similar, task(s).

I honestly have no clue about which tool may be best for what I need to achieve and I hope that, by understanding how to rewrite this piece of code, it will be easier to understand which way to go.

There we go:

# read reference file line by line
while read -r linE;
    do
        # field 2 will be grepped
        pSeq=`echo $linE | cut -f2 -d" "`
        # field 1 will be used as filename to store the grepped things
        fName=`echo $linE | cut -f1 -d" "`
        # grep the thing in a very big file
        grep -i -B1 -A2 "^"$pSeq a_very_big_file.txt | sed 's/^--$//g' | awk 'NF' > $dir$fName".txt"
        # grep the same thing in another very big file and store it in the same file as abovr
        grep -i -B1 -A2 "^"$pSeq another_very_big_file.txt | sed 's/^--$//g' | awk 'NF'  >> $dir$fName".txt"
    done < reference_file.csv

At this point I am wondering...how to achieve the same result, whithout using a while loop to read into the reference_file.csv? What is the best way to go, to solve similar problems?

EDIT: when I mentioned the two very_big_files, I am talking > 5GB.

EDIT II: these should be the format of the files:

reference_file.csv:

object  pattern
 oj1      ptt1
 oj2      ptt2
 ...      ...
 ojN      pttN

a_very_big_file and another_very_big_file:

>head1
ptt1asequenceofcharacters
+
asequenceofcharacters
>head2
ptt1anothersequenceofcharacters
+
anothersequenceofcharacters
>headN
pttNathirdsequenceofcharacters
+
athirdsequenceofcharacters

Basically, I search for pattern in the two files, then I need to get the line above and the two below each match. Of course, not all the lines in the two files match with the patterns in the reference_file.csv.

gabt
  • 668
  • 1
  • 6
  • 20
  • 3
    There are many things that could be better. For one, don't use `echo | cut` to parse the line; let read do it for you: `while read fName pSeq _ ; do ....` – William Pursell Aug 12 '21 at 11:52
  • 1
    It's usually very helpful if you describe the input format of the involved files (with a few lines as examples) and the expected output format (with a few lines as examples). – Ted Lyngmo Aug 12 '21 at 11:59
  • Read the big files only once by forming a custom grep command. grep -e first -e second ... – stark Aug 12 '21 at 12:03
  • @TedLyngmo yes, ideed. but, rather than solve the actual problem, I would like to understand how to manage similar problems since, as in the question i linked, this is not the way to go. I wanted to keep it more general, somehow...but I am not even sure this is the right place to ask such questions. – gabt Aug 12 '21 at 12:03
  • 2
    The answers you get will depend on these formats. Someone who wants to properly help you right now would have to reverse-engineer the formats from your code - which is very tedious. Personally, I wont even try. – Ted Lyngmo Aug 12 '21 at 12:06
  • 1
    I'm with Ted Lyngmo. Efficient bash scripts are typically very creative and nothing you can achieve by iteratively improving a naive implementation. ¶ I inspected your script and had a few ideas, but every of those would change the result in some corner cases (e.g. the same `fName` appears multiple times, or one `pSeq` is in the context (`-B1 -A2`) of a different `pSeq`) . If those corner cases never appeared in your data, then that wouldn't be a problem. The more restrictions you input has, the more we can exploit them to create efficient solutions. – Socowi Aug 12 '21 at 12:14
  • added some explanation but, as I mentioned, what I am looking for is a way of thinking these problems rather than an actual solution (even though a solution may point me towards a way of thinking). – gabt Aug 12 '21 at 12:20
  • Your case is a bit tricky since it generates multiple dynamically named files. This most likely requires a loop. The typical way to speed up that loop is to use `awk` instead of bash's `while`. And once you have one `awk` script, the next logical thing is to do everything in `awk`. – Socowi Aug 12 '21 at 12:24
  • and that is not even the trickiest part...I guess I better start learning `awk`, then. – gabt Aug 12 '21 at 12:28
  • Oof. While it's my fault that you're here, this question reads as a bit overbroad (to the point of maybe being a better fit for [codereview.se]). "How can I avoid using grep and cut in a while read loop?" might be a less-open-ended way to pose it. – Charles Duffy Aug 12 '21 at 12:28
  • How many separate output files do you expect this to create? (If they'll all fit in the file descriptor table at once, we can do something a little bit like awk's fd-caching optimizations) – Charles Duffy Aug 12 '21 at 12:30
  • @stark, grep can also read patterns from a file or file descriptor, so they don't all need to fit in a command line – Charles Duffy Aug 12 '21 at 12:31
  • @CharlesDuffy, well I guess it was all linked to the question I linked... I am going crazy to solve a big problem. To answer your question, I expect 24 different outputs since that while above is supposed to run twice, on two different reference files. – gabt Aug 12 '21 at 12:32
  • Oh, 24 is small enough we can open them all at the same time in bash, no problem. It might have been a problem if there were thousands. – Charles Duffy Aug 12 '21 at 12:39
  • It _is_ worth learning some awk, though. `grep | sed | awk` can pretty much always be replaced with just awk, with the awk code extended to do whatever you were using grep and sed for before. – Charles Duffy Aug 12 '21 at 12:45
  • Anyhow -- is it fair game to change the format of your big files to allow easy constant-memory lookup, or provide code that creates _new_ files that are formatted in a way that allows proper joining? – Charles Duffy Aug 12 '21 at 12:47
  • point is will I also need `for` and `while` loops. and they can be nested. and...I am doing what I posted to reduce the dimension of the big files and somehow subset them to perform other tasks on the many smaller files. I guess the format is not a problem, they are textual files. – gabt Aug 12 '21 at 12:48
  • The format _is_ a problem, because files that are ordered by the lookup key are amenable to high-speed, memory-efficient single-pass algorithms; without that, random access (or starting from the beginning on every lookup, like your current code does) is needed. So to make your files amenable to both memory- and CPU-efficient search, we want to rewrite them to have one line per record, with the first line of that being the `ptt` value, and with the lines being sorted _by_ the `ptt` values. – Charles Duffy Aug 12 '21 at 12:49
  • how do you do it? – gabt Aug 12 '21 at 12:52
  • Do what? The join? See `man join`. If you read the output of the `join` command into your `while read` loop, that gives you all the data from both inputs at once. – Charles Duffy Aug 12 '21 at 12:53
  • See an example at [merge/join two tables fast linux command line](https://stackoverflow.com/questions/13300271/merge-join-two-tables-fast-linux-command-line). There, they're running the `sort` in-line, but of course it's faster if the sorting is done already before the program even starts. – Charles Duffy Aug 12 '21 at 12:54
  • @gabt: I hope your CSV file will always be small. Count, how many child processes you start for each single input line, and multiply it by the number of lines!!! – user1934428 Aug 12 '21 at 12:55
  • @user1934428, ...to be fair, being told how much of a problem that is is why gabt is here, so I think that's already been adequately communicated. :) – Charles Duffy Aug 12 '21 at 12:55
  • @CharlesDuffy : Well complexity is only one of the problems (and in this case, it would make sense to split the problem into individual pieces, which can be discussed separately). But this is not the only issue: Unless we know more about the actual content of the indivdual file (how general is this CSV? Can there be multi-line fields?), the way of parsing it is questionable too. By and large, the problem should be IMO broken down. The question, as it stands now, is much too broad. – user1934428 Aug 12 '21 at 12:59
  • as I mentioned, the number of lines in the csv is 12, with two csvs. but that is not the point. there could be 1000 lines. What I meant asking through the question was...how to approach these kinds of problems in which you need to do things on huge files and using the bash as I showed is not recommended. I believe that the question is both quite complex broad and we can close it, if it's somehow off-topic or whatever, that is not a problem. – gabt Aug 12 '21 at 13:03
  • @gabt, ...we do like questions with narrow, specific answers here -- as the Help Center says, "If you can imagine an entire book that answers your question, you’re asking too much". This is definitely a topic that books could be written on. – Charles Duffy Aug 12 '21 at 13:08
  • @CharlesDuffy I guess you're right so, well...I might have aimed too high. What to do, then? Should I close it? – gabt Aug 12 '21 at 13:09
  • Either close it or narrow it. Some of the problems the code has are already answered elsewhere in the knowledgebase (like how to get rid of the `echo | cut`s by having `read` split into variables itself, or how to use `join`), but if there's a specific question that isn't already answered, by all means edit to focus in on it. – Charles Duffy Aug 12 '21 at 13:10
  • ok. let's close it then. I need time to think this through. thank you for your suggestions, tough. – gabt Aug 12 '21 at 13:11
  • ...honestly, if you know join is an option but can't quite figure out how to get your data into a format it can read, that might be something that could be transformed into a specific, narrow question as well. – Charles Duffy Aug 12 '21 at 13:11

1 Answers1

3

Global Maxima

Efficient bash scripts are typically very creative and nothing you can achieve by incrementally improving a naive solution.

The most important part of finding efficient solutions is to know your data. Every restriction you can make allows optimizations. Some examples that can make a huge difference:
- The input is sorted or data in different files has the same order.
- The elements in a list are unique.
- One of the files to be processed is way bigger than the others.
- The symbol X never appears in the input or only appears at special places.
- The order of the output does not matter.

When I try to find an efficient solution, my first goal is to make it work without an explicit loop. For this, I need to know the available tools. Then comes the creative part of combining these tools. To me, this is like assembling a jigsaw puzzle without knowing the final picture. A typical mistake here is similar to the XY problem: After you assembled some pieces, you might be fooled into thinking you'd know the final picture and search for a piece Y that does not exist in your toolbox. Frustrated, you implement Y yourself (typically by using a loop) and ruin the solution.
If there is no right piece for your current approach, either use a different approach or give up on bash and use a better scripting/programming language.

Local Maxima

Even though you might not be able to get the best solution by improving a bad solution, you still can improve it. For this you don't need to be very creative if you know some basic anti-patterns and their better alternatives. Here are some typical examples from your script:

Some of these might seem very small, but starting a new process is way more expensive than one might suppose. Inside a loop, the cost of starting a process is multiplied by the number of iterations.

Extract multiple fields from a line

Instead of calling cut for each individual field, use read to read them all at once:

while read -r line; do
  field1=$(echo "$line" | cut -f1 -d" ")
  field2=$(echo "$line" | cut -f2 -d" ")
  ...
done < file

while read -r field1 field2 otherFields; do
  ...
done < file

Combinations of grep, sed, awk

Everything grep (in its basic form) can do, sed can do better. And everything sed can do, awk can do better. If you have a pipe of these tools you can combine them into a single call.

Some examples of (in your case) equivalent commands, one per line:

sed 's/^--$//g' | awk 'NF'
sed '/^--$/d'
grep -vFxe--

grep -i -B1 -A2 "^$pSeq" | sed 's/^--$//g' | awk 'NF'
awk "/^$pSeq/"' {print last; c=3} c>0; {last=$0; c--}'

Multiple grep on the same file

You want to read files at most once, especially if they are big. With grep -f you can search multiple patterns in a single run over one file. If you just wanted to get all matches, you would replace your entire loop with

grep -i -B1 -A2 -f <(cut -f2 -d' ' reference_file | sed 's/^/^/') \
a_very_big_file another_very_big_file

But since you have to store different matches in different files ... (see next point)

Know when to give up and switch to another language

Dynamic output files

Your loop generates multiple files. The typical command line utils like cut, grep and so on only generate one output. I know only one standard tool that generates a variable number of output files: split. But that does not filter based on values, but on position. Therefore, a non-loop solution for your problem seems unlikely. However, you can optimize the loop by rewriting it in a different language, e.g. awk.

Loops in awk are faster ...

time awk 'BEGIN{for(i=0;i<1000000;++i) print i}' >/dev/null # takes 0.2s
time for ((i=0;i<1000000;++i)); do echo $i; done >/dev/null # takes 3.3s
seq 1000000 > 1M
time awk '{print}' 1M >/dev/null                        # takes 0.1s
time while read -r l; do echo "$l"; done <1M >/dev/null # takes 5.4s

... but the main speedup will come from something different. awk has everything you need built into it, so you don't have to start new processes. Also ... (see next point)

Iterate the biggest file

Reduce the number of times you have to read the biggest files. So instead of iterating reference_file and reading both big files over and over, iterate over the big files once while holding reference_file in memory.

Final script

To replace your script, you can try the following awk script. This assumes that ...

  1. the filenames (first column) in reference_file are unique
  2. the two big files do not contain > except for the header
  3. the patterns (second column) in reference_file are not prefixes of each other.
    If this is not the case, simply remove the break.

awk -v dir="$dir" '
FNR==NR {max++; file[max]=$1; pat[max]=$2; next}
{
  for (i=1;i<=max;i++)
    if ($2~"^"pat[i]) {
      printf ">%s", $0 > dir"/"file[i]
      break
    }
}' reference_file RS=\> FS=\\n a_very_big_file another_very_big_file
Socowi
  • 25,550
  • 3
  • 32
  • 54