0

I have a dataset that looks something like this:

chr1    StringTie   exon    197757319   197757401   1000    +   .   gene_id "MSTRG.10429"; transcript_id "ENST00000440885.1"; exon_number "1"; gene_name "RP11-448G4.4"; ref_gene_id "ENSG00000224901.1";
chr1    StringTie   exon    197761802   197761965   1000    +   .   gene_id "MSTRG.10429"; transcript_id "ENST00000440885.1"; exon_number "2"; gene_name "RP11-448G4.4"; ref_gene_id "ENSG00000224901.1";
chr9    StringTie   exon    63396911    63397070    1000    -   .   gene_id "MSTRG.145111"; transcript_id "MSTRG.145111.1"; exon_number "1";
chr9    StringTie   exon    63397111    63397185    1000    -   .   gene_id "MSTRG.145111"; transcript_id "MSTRG.145111.1"; exon_number "2";
chr21   StringTie   exon    44884690    44884759    1000    +   .   gene_id "MSTRG.87407"; transcript_id "MSTRG.87407.1"; exon_number "1";
chr22   HAVANA  exon    19667023    19667199    .   +   .   gene_id "ENSG00000225007.1"; transcript_id "ENST00000452326.1"; exon_number "1"; gene_name "AC000067.1";
chr22   HAVANA  exon    19667446    19667555    .   +   .   gene_id "ENSG00000225007.1"; transcript_id "ENST00000452326.1"; exon_number "2"; gene_name "AC000067.1";

I want to isolate the gene_ids. Therefore, the desired output is:

MSTRG.10429
MSTRG.10429
MSTRG.145111
MSTRG.145111
MSTRG.87407
ENSG00000225007.1
ENSG00000225007.1

I've tried the following:

grep -E -o "gene_id.{0,20}" gtf_om_ENSGids_te_vinden.gtf > alle_gene_ids.txt

With this I can grep the 20 characters after "gene_id" and I wanted to later remove the other characters which do not belong to the answer such as parts of the word "transcript". However, a problem is that the ref_gene_ids also get copied, which does not belong to the desired output. I tried to solve this by adding the -w flag, but this is also wrong for some reason. Can anyone help?

Thanks!

2 Answers2

1

GNU grep, using the perl regex flag:

grep -Po '(?<=\Wgene_id ")[^"]+'

POSIX sed:

sed -En 's/.*[^[:alnum:]_]gene_id "([^"]+).*/\1/p'

If there are multiple occurrences per line, the grep will print all of them, but the sed will print the last occurrence only.

dan
  • 4,846
  • 6
  • 15
0

Use:

grep -o -E ' gene_id \"([^"]*)\"' gtf_om_ENSGids_te_vinden.gtf  | sed -E 's/gene_id|"| //g'
  • the space in ' gene_id is needed to make sure the ref_gene_id is not matched.
  • The sed part will remove gene_id, the space, and the double quotes.

see: https://regex101.com/r/TDA7Cg/1

EDIT: Because of the tab, which is not a space:

Change it to

grep -o -E '[ \t]gene_id \"([^"]*)\"' gtf_om_ENSGids_te_vinden.gtf  | sed -E 's/gene_id|"| //g'

or to just find the start of the word you could to

grep -o -E '\Wgene_id \"([^"]*)\"' gtf_om_ENSGids_te_vinden.gtf  | sed -E 's/gene_id|"| //g'

But still the accepted answer is a nicer way to do it ...

Luuk
  • 12,245
  • 5
  • 22
  • 33
  • thanks, however, this does not work yet, I think because there is not a space but a tab used in front of the "gene_id" part. When I use the code like this I don't receive any output – Jasmin Jonson Dec 18 '21 at 21:08