1

I am trying to write a sh script that will open a text file, grep each line in a tsv file, then do various things with the output. I'm stuck on the grepping.

strains_list_test:

p1.A2
p1.A3
p1.C5
p1.D11
p1.D2
p8.H2

The sh script so far:

CURRENT_DIR=$(pwd)
STRAINS=$(pwd)/strains_list_test
OUTPUT_DIR=$(pwd)/output

BLAST_FILE=$(pwd)/filtered2.tsv

#this generates fasta files for all strains in input file.
cat $STRAINS | while read strain_name
    do
        echo "strain name is" $strain_name
        grep '$strain_name' $BLAST_FILE | head 
        
    done

The output looks like this:

strain name is p1.A2
strain name is p1.A3
strain name is p1.C5
strain name is p1.D11
strain name is p1.D2
strain name is p8.H2

So in each loop, the grep returns nothing. After searching online for answers I have tried '$strain_name', "$strain_name", "/$strain_name", "${strain_name}" and god knows what else, but to no avail.

The thing is, if I leave out the variable and just do grep "p8.H2" $BLAST_FILE | head (for example), I get the correct output. So at least that part works, but something about the way I use the variable as grep input or maybe the line reading is broken... Even though the echo line still prints the variable correctly.

EDIT: Multiple people have recommended I use double quotes instead. As I said above, I have already tried "$strain_name".

Here's an example TSV file.

ProtName p26.C10|protID 58.744 223 83 2 1 216 1 221 1.95e-59 234 100
ProtName p26.C10|protID 38.000 150 68 1 216 340 72 221 6.37e-14 85.5 100
ProtName p8.H2|protID 34.300 207 100 5 101 278 22 221 1.20e-12 81.3 100
ProtName p23.A4|protID 72.002 1718 453 4 340 2029 72 1789 0.0 2511 100
ProtName p23.A4|protID 58.744 223 83 2 1 216 1 221 1.95e-59 234 100
ProtName p23.A4|protID 38.000 150 68 1 216 340 72 221 6.37e-14 85.5 100
ProtName p23.A4|protID 34.300 207 100 5 101 278 22 221 1.20e-12 81.3 100

Here is the current code in its entirety.

#!/bin/bash

CURRENT_DIR=$(pwd)
STRAINS=$(pwd)/strains_list_test
OUTPUT_DIR=$(pwd)/output

BLAST_FILE=$(pwd)/filtered2_sample.tsv

cat $STRAINS | while read strain_name
    do
        echo "strain name is" $strain_name
        grep "$strain_name" $BLAST_FILE | head 
        
    done
echo "testing non-variable grep"
grep "p8.H2" $BLAST_FILE | head 

SECOND EDIT: I have tried running my code with the bash -x script command to provide a detailed log. Here is what I get in the test echo lines: echo 'strain name is' $'p8.H2\r' Maybe the \r is the reason the grep isn't working?. Any ideas on how to fix this?

THIRD EDIT: The grep definitely isn't the issue: I tried this instead:

for strain_name in p1.A2 p1.A3 p1.C5 p1.D11 p8.H2 p1.D2

So it works when I don't read the file and have the strains list directly in the script. I don't really want this for the final version, but this suggests there's something wrong with the way the strains_list_test file is being read. (And no, before you ask, changing the "while read" to "for ... in" alone didn't do it.)

FOURTH EDIT The above code works when I change the strains_list_test file from a column to just p1.A2 p1.A3 p1.C5 p1.D11 p8.H2 p1.D2 So I found a way to do what I wanted to do. However, it's still not clear why the previous version wasn't working.

Hek
  • 21
  • 5
  • 1
    Careful sh != bash. What shebang line do you use at the top of your script? Bash ou sh? – Nic3500 Feb 21 '23 at 16:07
  • Can we assume that tsv is "tab separate values"? Your sample data does not have tabs (that we can see anyway). Provide a complete example with a sample blast file. – Nic3500 Feb 21 '23 at 16:09
  • 1
    Ah, since you put single quotes around your variable in the `grep` it is not evaluated. Put double quotes ( " ) – Nic3500 Feb 21 '23 at 16:12
  • Just add an `echo` before `grep` and double-check what you get. Compare the output from this `echo grep '$strain_name' $BLAST_FILE` to the output from this `echo grep "$strain_name" $BLAST_FILE`. There. That’s the issue. – Andrej Podzimek Feb 21 '23 at 16:12
  • https://stackoverflow.com/questions/6697753/difference-between-single-and-double-quotes-in-bash – Nic3500 Feb 21 '23 at 16:20
  • 1
    BTW, `$(pwd)` is a lot slower than `$PWD` – Charles Duffy Feb 21 '23 at 16:21
  • @AndrejPodzimek, `echo` is a poor choice of tools here; `set -x` logs are much more reliable. For example, there's no difference in the output of `echo "hello" "world"` and `echo "hello world"`, even though they're very different commands (and for things that aren't echo, the difference is generally critical). – Charles Duffy Feb 21 '23 at 16:22
  • 1
    The single quotes aside, you should avoid starting external processes from a cycle (unless it is absolutely necessary): `readarray -t strains < "${PWD}/strains_list_test" ; blast_file="${PWD}/filtered2.tsv" ; grep -E "$(IFS='|'; echo "${strains[*]//./\\.}")" < "$blast_file"`. In case the strains list is really huge, `readarray` has useful options `-c` and `-C` to split the processing into manageable chunks. – Andrej Podzimek Feb 21 '23 at 16:23
  • Also, as a rule, avoid `cat file | while read ...` in favor of `while read ...; done – Charles Duffy Feb 21 '23 at 16:24
  • @CharlesDuffy Agreed, in general, but I still like `echo` as a quick and dirty hack. `set -x` is great, but mostly requires a subshell `( set -x; some_commmand; )` so that I don’t display lots of output irrelevant to the debugging task at hand. – Andrej Podzimek Feb 21 '23 at 16:25
  • 1
    @AndrejPodzimek, you can also use `{ set +x; } 2>/dev/null` to turn off `set -x` without creating extra log content or requiring a subshell. (Substitute a different file descriptor if you have `BASH_XTRACEFD` pointed somewhere other than stderr, of course) – Charles Duffy Feb 21 '23 at 16:26
  • @user21259325, ...another aside: consider using lowercase names for your own variables. https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html specifies all-caps names to be used for variables meaningful to the shell and operating-system-provided tools, whereas lowercase names are "reserved for application use" and guaranteed not to have unintended side effects on POSIX-compliant tools. (Read the link keeping in mind that setting a regular shell variable overwrites any like-named environment variable, so the same conventions apply to both types). – Charles Duffy Feb 21 '23 at 16:28
  • @Nic3500 As I have already said in my first post, I have tried double quotes instead of single quotes. Same result. – Hek Feb 22 '23 at 08:52
  • @AndrejPodzimek I have tried double quotes. – Hek Feb 22 '23 at 08:59
  • I've uploaded some sample code for the TSV so you can try to run it yourself. Sorry for not including that from the beginning. I appreciate the unrelated feedback (like on faster and robuster ways to do this) but for now I need to get this basic part working, first. – Hek Feb 22 '23 at 09:29
  • I've solved the issue. The reason it wasn't working even with double quotes was the strains_list_test file. I changed the file from a vertical list to a horizontal (so only spaces separating the names, no line break) and now it works (with double quotes ofc) (though I switched to a for loop because the cat read was treating one line as a single entry while iterating). – Hek Feb 22 '23 at 10:23
  • To rephrase, I found a way to do what I wanted to do by slightly changing the input file and my way of iterating through it, but it's still not clear why the original version - WITH the double quotes - wasn't working. – Hek Feb 22 '23 at 10:57
  • That `\r` in your test output is probably a Windows-style end-of-line, which consists of two characters: `\r\n`, whereas Unix-alikes use only `\n`. See here: [https://stackoverflow.com/questions/12747722/what-is-the-difference-between-a-line-feed-and-a-carriage-return](https://stackoverflow.com/questions/12747722/what-is-the-difference-between-a-line-feed-and-a-carriage-return) Did you maybe create the file on Windows? If that is the problem, this post is not a duplicate of the one about double quotes. – Cloudberry Feb 22 '23 at 19:22
  • @Cloudberry Looks like that IS the issue! I did not create the file on Windows, but the file is a copy (that I deleted most of the contents of, to make a shorter sample dataset) of a file I received from someone else, who presumably created it on Windows. It looks like copying the file, even on a Linux system, was enough to transfer the hidden end-of-line. I created a new file, manually typed out by hand the same names as in the other file, and with THIS version, the code works. So yes, looks like hidden end-of-line characters as carry-over from Windows were the problem. – Hek Feb 23 '23 at 08:51
  • 1
    Addendum: Typing it out by hand was overkill, it was enough to make a new file, copy the contents of the old (Windows-created) file in the (gedit, Ubuntu-default) text editor, and that worked. Thanks for helping me get to the bottom of this, Cloudberry! – Hek Feb 23 '23 at 08:57
  • Here's how to get rid of them without having to edit manually: [https://stackoverflow.com/questions/800030/remove-carriage-return-in-unix](https://stackoverflow.com/questions/800030/remove-carriage-return-in-unix) – Cloudberry Feb 23 '23 at 17:32

0 Answers0