0

I've a file that contains info that I'm retrieving such way

Command

cat 2018_02_15_09_01_08_result.tsv | grep -o [A-Z]\\*[0-9]*:[0-9]* | sort | uniq | sed -e 's/^/HLA-/'  |tr '\n' ',' | sed '$ s/.$//'

Output

HLA-A*30:02,HLA-B*18:01,HLA-C*05:01

But I'm trying to save this in variable, the asterisk and a letter disappears, I've tried several ways, adding/removing commas etc and I'm yet not able to print it properly.

hla=`cat 2018_02_15_09_01_08_result.tsv | grep -o [A-Z]\\*[0-9]*:[0-9]* | sort | uniq | sed -e 's/^/HLA-/'  |tr '\n' ',' | sed '$ s/.$//'`

echo $hla
HLA-05:01,HLA-18:01,HLA-30:02
echo "$hla"
HLA-05:01,HLA-18:01,HLA-30:02
Biffen
  • 6,249
  • 6
  • 28
  • 36
HeyHoLetsGo
  • 137
  • 1
  • 14
  • Show us the contents of the input file and expected output, rather than a chain of commands. Am sure it can be done in a far simpler way – Inian Feb 28 '18 at 09:16
  • Does `echo "$hla"` solve your problem? --> https://stackoverflow.com/questions/102049/how-do-i-escape-the-wildcard-asterisk-character-in-bash – aPugLife Feb 28 '18 at 09:21
  • No, `echo "$hla"` does not solve my problem, I've tried it. – HeyHoLetsGo Feb 28 '18 at 09:26

2 Answers2

2

There are multiple errors here, most of which will be aptly diagnosed by http://shellcheck.net/ without any human intervention.

  • You really should single-quote your regular expressions unless you specifically require the shell to perform wildcard expansion and whitespace tokenization on the regex before executing the command.

  • The obsolescent `command` in backticks introduces some unfortunate additional shell handling on the string inside the backticks. The solution since the 1990s is to prefer the $(command) syntax for command substitution, which does not exhibit this problem.

  • The cat is useless; grep knows full well how to read a file.

Try this refactored code:

hla=$(grep -o '[A-Z]*[0-9]*:[0-9]*' 2018_02_15_09_01_08_result.tsv |
  sort -u | sed -e 's/^/HLA-/'  |tr '\n' ',' | sed '$ s/.$//')
echo "$hla"

The double quotes around the variable interpolation in the echo are necessary and useful; notice also the line wraps for legibility and the use of sort -u in preference over sort | uniq (and generally try to reduce the number of processes -- once I understand what the sed | tr | sed does I can probably propose a simplification for that, too). Perhaps the simplest fix would be to refactor all of this into a single Awk script, but without access to the input, it's hard to tell you in more detail what that might look like.

(Also, are you really sure you need to capture the value to a variable? Often variable=value; echo "$variable" is just an obscure and inefficient way to say echo "value". And variable=$(command); echo "$variable" is better written simply command and capturing the command's standard output just so you can print it to standard output is a pure waste of cycles, unless you are planning to do something more with that variable's value.)

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks, extremely helpful. Input after the `grep | sort` is something like `HLA-A*30:02\n HLA-B*18:01 HLA-C*05:01` – HeyHoLetsGo Feb 28 '18 at 15:06
  • You can replace the whole pipeline from `sort` on with something like `awk -F- '!($2 in a){a[$2]; printf("%s%s", s, $2");s=","}'` (untested, but you get the idea) but it would probably not be hard to refactor out the first `grep` too. – tripleee Feb 28 '18 at 15:17
-1

I've solved it by saving the output of the command with a redirection:

cat 2018_02_15_09_01_08_result.tsv |
grep -o [A-Z]\\*[0-9]*:[0-9]* |
sort | uniq |
sed -e 's/^/HLA-/'  |tr '\n' ',' | sed '$ s/.$//' > out_file
hla=`cat out_file`
echo $hla

which gets me the expected HLA-A*30:02,HLA-B*18:01,HLA-C*05:01. Not the ideal solution, but it works.

tripleee
  • 175,061
  • 34
  • 275
  • 318
HeyHoLetsGo
  • 137
  • 1
  • 14