0

I have a names.dmp file which contains taxonomy ids and scientific names among other details.

I want to fetch the scientific name of a particular tax-id, for which I am running this command:

cat names.dmp | grep "scientific name" | awk '$1~/^10090$/{print $0}' | cut -d "|" -f1,2

which gives me the output:

10090 | Mus musculus

But I need this to be dynamic, i.e., set a variable id=10090 and use this variable inside the regular expression. I need an exact match of the value while using "id", as there are entries such as 210090 and 100904 which I am getting as output which are not needed.

I am quite inexperienced when it comes to awk, so any help is appreciated.

EDIT:

Here is the example input:

10089   |       Mus formosanus Kuroda, 1925     |               |       authority       |
10089   |       Mus formosanus  |               |       synonym |
10089   |       ricefield mouse |               |       common name     |
10089   |       Ryukyu mouse    |               |       genbank common name     |
10090   |       house mouse     |               |       genbank common name     |
10090   |       LK3 transgenic mice     |               |       includes        |
10090   |       mouse   |       mouse <Mus musculus>    |       common name     |
10090   |       Mus musculus Linnaeus, 1758     |               |       authority       |
10090   |       Mus musculus    |               |       scientific name |
10090   |       Mus sp. 129SV   |               |       includes        |
10090   |       nude mice       |               |       includes        |
10090   |       transgenic mice |               |       includes        |
10091   |       Mus castaneus   |               |       synonym |
10091   |       Mus musculus castaneus  |               |       scientific name |
10091   |       Mus musculus castaneus Waterhouse, 1843 |               |       authority       |
10091   |       southeastern Asian house mouse  |               |       genbank common name     |
10092   |       Mus domesticus  |               |       synonym |
10092   |       Mus musculus domesticus Schwarz & Scharz 1943   |               |       authority       |
10092   |       Mus musculus domesticus |               |       scientific name |
10092   |       Mus musculus praetextus |               |       synonym |
100902  |       Fusarium oxysporum f. sp. conglutinans  |               |       scientific name |
100903  |       Fusarium oxysporum f. sp. fragariae     |               |       scientific name |
100905  |       Cloning vector pACN     |               |       scientific name |
100906  |       Nitrosomonas sp. ENI-11 |               |       scientific name |
100907  |       Chilean sea bass        |               |       common name     |

And the output I need is:

10090 | Mus musculus

chan-98
  • 53
  • 4

4 Answers4

2

One option would be:

id=10090
awk -v id="$id" '/scientific name/ && $1 == id' names.dmp | cut -d "|" -f1,2

You can also preserve whitespace in awk (using e.g. How to preserve the original whitespace between fields in awk?) and incorporate the cut command into your awk command, but as you describe yourself as 'inexperienced', this is probably the best solution.

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • Yes I managed to try this as well: `echo $(cat names.dmp | grep "scientific name" | awk -v pattern=$id '$1==pattern' | cut -d "|" -f1,2 )` It works. thank you. – chan-98 Jul 26 '23 at 12:41
  • 2
    @chan-98 not `-v pattern=$id` but `-v pattern="$id"`. Always quote your shell variables unless you have a specific reason to remove the quotes. Also, you don't need `cut` when you're using awk. – Ed Morton Jul 26 '23 at 17:02
2

When you use awk, frequently, you don't need anything else:

$ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
  /scientific name/ && $1 == id {print $1 " | " $2}' file
10090 | Mus musculus
  1. -F'[[:space:]]*\\|[[:space:]]*': set the input field separator as space-surrounded |.
  2. -v id="10090": declare awk variable id and assign it 10090 (change this if needed).
  3. If the input record matches string scientific name and the first field equals id, print the two first fields separated by |.

As noted in comments this does not preserve the input field separators. In case you want to preserve them you can use the split function of GNU awk, instead of the input field separator, to save the fields in an array and the separators in another:

$ awk -v id="10090" '/scientific name/ {
    split($0,f,/[[:space:]]*\|[[:space:]]*/,s)
    if(f[1] == id) print f[1] s[1] f[2]}' file
10090   |       Mus musculus

Finally, if your awk is not GNU awk but you want to preserve the field separators, you can use match and substr instead of split:

$ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
  /scientific name/ && $1==id {
    a=match($0,/\|/); b=match(substr($0,a+1),/[[:space:]]*\|/)
    print substr($0,1,a+b-1)}' file
10090   |       Mus musculus

We simply use match to find the index of the first | (a), then the index of the first space before the second | (b), and print only the everything before that (substr).

Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
  • It works. I had tried this `echo $(cat names.dmp | grep "scientific name" | awk -v pattern=$id '$1==pattern' | cut -d "|" -f1,2 )` But yours is obviously a much better way. Thanks a lot – chan-98 Jul 26 '23 at 12:40
  • @chan-98 this doesn't produce the output you desire, does it? according to your question, it should be `10090 | Mus musculus` – Paolo Jul 26 '23 at 12:44
  • @Paolo Good point. I added another solution that preserves the input separators, in case it matters. – Renaud Pacalet Jul 26 '23 at 13:05
  • The new solution produced a different, incorrect output `10090 | Mus musculus` – Paolo Jul 26 '23 at 13:07
  • Even though it seems like now OP has changed the desired output in the question, so whatever – Paolo Jul 26 '23 at 13:07
  • When you find you need to double escapes, e.g. `\\|`, consider using a bracket expression instead, `[|]` for slightly improved clarity. It's obvious at a glance that `[|]` means a literal `|` but you have to consider context and think about it to understand whether `\\|` is a literal `|` or a literal ```\``` followed by a regexp "or" metachar `|`. – Ed Morton Jul 26 '23 at 17:06
1

A possible solution:

$ id=10090
$ awk -v id="$id" 'BEGIN{FS="| +";OFS="    |   "} /scientific name/ && $1==id {print $1,$3" "$4}' file
10090    |   Mus musculus
Paolo
  • 21,270
  • 6
  • 38
  • 69
0

While you can set awk variables from the outside and that this is usually the best solution, your specific case is so simple that interpolation by the shell works as well:

awk '$1~/^'$id'$/{print $0}'

Since you know that your id is always a string of digits, you don't even have to double-quote here.

user1934428
  • 19,864
  • 7
  • 42
  • 87
  • Oh I was trying this without surrounding $id with quotes. Thank you so much – chan-98 Jul 26 '23 at 12:44
  • @chan-98 : You don' surround `id` with quotes. You need the part to the left of $id to be quoted to avoid expansion of $1, and the part to the right to avoid word splitting and expansion of $0 and $/. You could have it also written completely unquoted as `awk \$1~/^$id\$/{print\ \$0}` – user1934428 Jul 26 '23 at 12:50
  • That can lead to cryptic failures if `id` ever contains some unexpected value as it's contents become part of the script before awk sees it and is creating a malware injection weakness and can cause yet other failures if `$i` contains yet other values as you're exposing it unquoted to the shell and so asking the shell to perform globbing, word splitting, and filename expansion on it before awk sees it, and it's treating the contents of $i as a regexp and so can have yet more failure cases with unexpected chars in it such as a `.`. – Ed Morton Jul 26 '23 at 17:24
  • 1
    So don't do that, instead just set an awk variable and do a string comparison - `awk -v id="$id" '$1 == id'`. See [how-do-i-use-shell-variables-in-an-awk-script](https://stackoverflow.com/questions/19075671/how-do-i-use-shell-variables-in-an-awk-script) for more info. And don't write `awk \$1~/^$id\$/{print\ \$0}` - always quote strings (including scripts) in shell unless you have a very specific need to remove the quotes. – Ed Morton Jul 26 '23 at 17:25
  • 1
    Yes you could say that because the OP shows a presumably hard-coded `id=10090` as their starting point that we could write the code as `awk '$1~/^'$id'$/'` but there's just no point getting into the habit of sometimes doing that vs always (unless you have a specific reason not to) using the far more generally robust `awk -v id="$id" '$1==id'` which is just a handful of characters longer and would continue to work if the OPs real code now or in future doesn't actually hard-code `id` but gets it from user input, for example. – Ed Morton Jul 26 '23 at 17:41