use sed to extract two pieces of text at once from a line

Question

OK, I've found similar answers on SO but my sed / grep / awk fu is so poor that I couldn't quite adapt them to my task. Which is, given this file "test.gff":

accn|CP014704   RefSeq  CDS 403 915 .   +   0   ID=AZ909_00020;locus_tag=AZ909_00020;product=transcriptional regulator
accn|CP014704   RefSeq  CDS 928 2334    .   +   0   ID=AZ909_00025;locus_tag=AZ909_00025;product=FAD/NAD(P)-binding oxidoreductase
accn|CP014704   RefSeq  CDS 31437   32681   .   +   0   ID=AZ909_00145;locus_tag=AZ909_00145;product=gamma-glutamyl-phosphate reductase;gene=proA
accn|CP014704   RefSeq  CDS 2355    2585    .   +   0   ID=AZ909_00030;locus_tag=AZ909_00030;product=hypothetical protein

I want to extract two values 1) text to the right of "ID=" up to the semicolon and 2) text to the right of "product=" up to the end of the line OR a semicolon (since you can see one of the lines also has a "gene=" value.

So I want something like this:

ID    product
AZ909_00020    transcriptional regulator
AZ909_00025    FAD/NAD(P)-binding oxidoreductase
AZ909_00145    gamma-glutamyl-phosphate reductase

This is as far as I got:

printf "ID\tproduct\n"

sed -nr 's/^.*ID=(.*);.*product=(.*);/\1\t\2\p/' test.gff

Thanks!

The data that you've provided doesn't follow a pattern. For instance, there is a `gene=proA` in the 3rd columns. Would there be any more optional fields like this? — sjsam, Sep 05 '16 at 00:35

redneb · Accepted Answer · 2016-09-05T00:39:58.923

Try the following:

sed 's/.*ID=\([^;]*\);.*product=\([^;]*\).*/\1\t\2/' test.gff

Compared to your attempt, I changed the way you match for the product. Since we don't know if the field ends with a ; or EOL, we just match the largest possible number of non ; characters. I also added a .* at the end to match any possible leftover characters after the product. This way, when we do the substitution, the entire line will match and we will be able to rewrite it completely.

If you want something slightly more robust, here's a perl one-liner:

perl -nle '($id)=/ID=([^;]*)/; ($prod)=/product=([^;]*)/; print "$id\t$prod"' test.gff

This extracts the two fields separately using regular expressions. It will work correctly, even if the fields appear in reverse order.

perl solution is ideal here i guess. +1 – sjsam Sep 05 '16 at 00:57 — sjsam, Sep 05 '16 at 00:57

sjsam · Answer 2 · 2016-09-05T01:40:57.630

If you've GNU-awk aka gawk at your disposal you may try something like below:

With awk

gawk 'BEGIN{printf "ID\tProduct%s",RS}
     {printf "%s\t%s%s",gensub(/^.*[[:blank:]]+ID=([^;]*);.*$/,"\\1","1",$0),
      gensub(/^.*;product=([^;]*)[;]*.*$/,"\\1","1",$0),RS}
    ' test.gff | expand -t20

Output

ID                  Product
AZ909_00020         transcriptional regulator
AZ909_00025         FAD/NAD(P)-binding oxidoreductase
AZ909_00145         gamma-glutamyl-phosphate reductase
AZ909_00030         hypothetical protein

As you've noticed, the two gensubs are doing the heavy-lifting here.

In gensub(/^.*[[:blank:]]+ID=([^;]*);.*$/,"\\1","1",$0), everything apart from the stuff that is contained between ID= and the first semi-colon that follows is stripped of from the record(see $0). Note gensub doesn't modify the record itself but it just returns the modified string which is printed.
in gensub(/^.*;product=([^;]*)[;]*.*$/,"\\1","1",$0), similary anything apart from the stuff in between product= and the first semicolon(or the end) is stripped of
Finally we've used expand -t to increase the tab width to get a nicely formatted output.
Since hardcoding \n is a bad practice I've used inbuilt record separator variable RS to print the newline after each record.

A sed solution using similar logic is below:

Using sed

printf "%-20s%s\n" "ID" "Product"
sed -E "s/^.*[[:blank:]]+ID=([^;]*);.*;product=([^;]*)[;]*.*$/\\1\t\\2/" 39322581 | expand -t20

Output

ID                  Product
AZ909_00020         transcriptional regulator
AZ909_00025         FAD/NAD(P)-binding oxidoreductase
AZ909_00145         gamma-glutamyl-phosphate reductase
AZ909_00030         hypothetical protein

Considering that you have been provided a short and elegant perl solution you might consider using that too if you've perl at your disposal.

^{A side note: Using \n with printf makes the script less portable}

Ed Morton · Answer 3 · 2016-09-05T06:55:37.857

The main problem with your regexp was using .* instead of [^;]* since .* will match all characters but you just want to match non-semi-colons. Try this:

$ sed -E 's/.*ID=([^;]+).*product=([^;]+).*/\1\t\2/' file
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

or:

$ awk -F'[=;]' -v OFS='\t' '{print $2, $6}' file
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

and you can extract the header values easily with awk too:

$ awk -F'[=;]' -v OFS='\t' 'NR==1{sub(/.* /,"",$1); print $1, $5} {print $2, $6}' file
ID      product
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

score 0 · Answer 4 · answered Sep 05 '16 at 03:38

Another in awk. We add ";" to the list of field separators (FS), strip off strings "ID=" and "product=" and print fields 9 and 10:

$ awk -F'([ \t\n]+|;)' 'BEGIN{print "ID" OFS "Product"}{gsub(/product=|ID=/,""); print $9,$10}' test.gff
ID Product
AZ909_00020 locus_tag=AZ909_00020
AZ909_00025 locus_tag=AZ909_00025
AZ909_00145 locus_tag=AZ909_00145
AZ909_00030 locus_tag=AZ909_00030

use sed to extract two pieces of text at once from a line

4 Answers4