0

I have something like

chr1 162724289 162724421 CAAAATGTTTATAAGGACAGCCTGCTCTCTCCCCTCAGTACAGGGCAGCTGCTTGCCTGTGAACCAGTAAACAGCTCTGTGGTTTCATGGTTGCTCCCTCTCTCCCCAACCCTCACCTCTCAAGGCTGGACT chr1 162724414 162724421 ID=exon:ENST00000367921.3:5;Parent=ENST00000367921.3;gene_id=ENSG00000162733.12;transcript_id=ENST00000367921.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=DDR2;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=DDR2-002;exon_number=5;exon_id=ENSE00001165686.1;level=2;protein_id=ENSP00000356898.3;ccdsid=CCDS1241.1;havana_gene=OTTHUMG00000034423.4;havana_transcript=OTTHUMT00000097650.1;tag=basic,appris_principal,CCDS

I would like to extract only the exon_number=5 from the 8th column. This is kind of a long one line command and, since I have other columns I want to keep, I guess that I cannot use awk -F ';'. I tried something like:

sed -E 's/ ID=*\(exon_number=[0-9]\)* \1/'

Desired output:

chr1 162724289 162724421 CAAAATGTTTATAAGGACAGCCTGCTCTCTCCCCTCAGTACAGGGCAGCTGCTTGCCTGTGAACCAGTAAACAGCTCTGTGGTTTCATGGTTGCTCCCTCTCTCCCCAACCCTCACCTCTCAAGGCTGGACT chr1 162724414 162724421 exon_number=5

Any advice would be great! Thanks

too honest for this site
  • 12,050
  • 4
  • 30
  • 52
Tato14
  • 425
  • 1
  • 4
  • 9
  • `grep -Eo 'exon_number=[[:digit:]]+'`? – Benjamin W. Jul 04 '18 at 14:06
  • 1
    With `sed -E 's/.*\<(exon_number=[0-9]+).*/\1/'`, I can only extract [`exon_number=26`](https://ideone.com/GlEfR1). How come you need `7`? – Wiktor Stribiżew Jul 04 '18 at 14:09
  • Sorry @WiktorStribiżew it was a mistake. Actually is `exon_number=26`. I will edit the question. – Tato14 Jul 05 '18 at 06:16
  • Hi @BenjaminW. That should be a good answer if I only want to print this. But, since I have more data in other columns that I would like to keep this is not my best option. That's why I asked for `sed` – Tato14 Jul 05 '18 at 06:19
  • Why do you have `ID=` in your pattern? – Wiktor Stribiżew Jul 05 '18 at 06:33
  • 1
    Try [`sed -E 's/(.* )ID=[^[:space:]]*(exon_number=[0-9]+).*/\1\2/'`](https://ideone.com/jvjJAp) – Wiktor Stribiżew Jul 05 '18 at 06:35
  • I will cast a reopen vote since the question is not about `grep`ping a value from text, and [Egrep/Sed: return only the regex match, not the whole line](https://stackoverflow.com/questions/18539494) does not solve the issue. – Wiktor Stribiżew Jul 05 '18 at 06:44
  • 1
    Thanks @WiktorStribiżew! That solved my problem. Capturing the first part was something I didn't thought (I'm kind of newbie :P). – Tato14 Jul 05 '18 at 06:49
  • 1
    @WiktorStribiżew The duplicate isn't a correct duplicate any longer, that's right. I'd also argue that it was before the question was edited to change input, expected output and wording, though. – Benjamin W. Jul 05 '18 at 13:14

2 Answers2

2

With sed, you may match and remove exactly what you want:

sed -E 's/(.* )ID=[^[:space:]]*(exon_number=[0-9]+).*/\1\2/'

See the online sed demo

Explanation

  • -E - POSIX ERE syntax enabling option
  • (.* )ID=[^[:space:]]*(exon_number=[0-9]+).* - a rege pattern matching:
    • (.* ) - Group 1: any 0+ chars, as many as possible, and then a space
    • ID=[^[:space:]]* - ID= and 0+ whitespace chars
    • (exon_number=[0-9]+) - exon_number= and 1 or more digits (Group 2)
    • .* - the rest of the line
  • \1\2 - the replacement pattern inserts the contents of Group 1 and 2 into the resulting string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

EDIT: As per OP changed the requirement so putting solution as per that only.

awk -F";" 'match($0,/exon_number=[0-9]+/){val=$1;sub(/ ID.*/,"",val);print val,substr($0,RSTART,RLENGTH)}'  Input_file

Following simple awk may help you here.

awk 'match($0,/exon_number=[0-9]+/){print substr($0,RSTART,RLENGTH)}' Input_file

Solution 2nd: In case your Input_file is having always same kind of data then simply print it by field.

awk -F";" '{print $11}'  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • Thanks @RavinderSingh13 for the answer. It's great but this kind of command prints me only this column and I would like to keep the rest. I guess I explained myself very badly. I updated my question with more information to make myself clearer. – Tato14 Jul 05 '18 at 06:29
  • @Tato14, building on this answer, what about `awk 'match($8,/exon_number=[0-9]+/){$8=substr($8,RSTART,RLENGTH)}' Input_file`? If you don't want to print, don't print. It's your script, after all, you can do what you like. – ghoti Jul 05 '18 at 06:33
  • Hi @ghoti, thanks for the answer. I guess I make not myself clear again. In this case I found a good option in the comments. – Tato14 Jul 05 '18 at 06:51
  • @Tato14, try my EDIT solution and let me know on same then? – RavinderSingh13 Jul 05 '18 at 07:07
  • 1
    I'm not sure how `print $11` would achieve the "Desired output" you included in your question. But if this is the solution that you *want*, then perhaps you could update your question to match, so that future generations reading this are not confused by the mismatched Q and A. – ghoti Jul 05 '18 at 14:46