OK, I've found similar answers on SO but my sed / grep / awk fu is so poor that I couldn't quite adapt them to my task. Which is, given this file "test.gff":
accn|CP014704 RefSeq CDS 403 915 . + 0 ID=AZ909_00020;locus_tag=AZ909_00020;product=transcriptional regulator
accn|CP014704 RefSeq CDS 928 2334 . + 0 ID=AZ909_00025;locus_tag=AZ909_00025;product=FAD/NAD(P)-binding oxidoreductase
accn|CP014704 RefSeq CDS 31437 32681 . + 0 ID=AZ909_00145;locus_tag=AZ909_00145;product=gamma-glutamyl-phosphate reductase;gene=proA
accn|CP014704 RefSeq CDS 2355 2585 . + 0 ID=AZ909_00030;locus_tag=AZ909_00030;product=hypothetical protein
I want to extract two values 1) text to the right of "ID=" up to the semicolon and 2) text to the right of "product=" up to the end of the line OR a semicolon (since you can see one of the lines also has a "gene=" value.
So I want something like this:
ID product
AZ909_00020 transcriptional regulator
AZ909_00025 FAD/NAD(P)-binding oxidoreductase
AZ909_00145 gamma-glutamyl-phosphate reductase
This is as far as I got:
printf "ID\tproduct\n"
sed -nr 's/^.*ID=(.*);.*product=(.*);/\1\t\2\p/' test.gff
Thanks!