Select only a part of string ("name=number" ) with sed

Question

I have something like

chr1 162724289 162724421 CAAAATGTTTATAAGGACAGCCTGCTCTCTCCCCTCAGTACAGGGCAGCTGCTTGCCTGTGAACCAGTAAACAGCTCTGTGGTTTCATGGTTGCTCCCTCTCTCCCCAACCCTCACCTCTCAAGGCTGGACT chr1 162724414 162724421 ID=exon:ENST00000367921.3:5;Parent=ENST00000367921.3;gene_id=ENSG00000162733.12;transcript_id=ENST00000367921.3;gene_type=protein_coding;gene_status=KNOWN;gene_name=DDR2;transcript_type=protein_coding;transcript_status=KNOWN;transcript_name=DDR2-002;exon_number=5;exon_id=ENSE00001165686.1;level=2;protein_id=ENSP00000356898.3;ccdsid=CCDS1241.1;havana_gene=OTTHUMG00000034423.4;havana_transcript=OTTHUMT00000097650.1;tag=basic,appris_principal,CCDS

I would like to extract only the exon_number=5 from the 8th column. This is kind of a long one line command and, since I have other columns I want to keep, I guess that I cannot use awk -F ';'. I tried something like:

sed -E 's/ ID=*$exon_number=[0-9]$* \1/'

Desired output:

chr1 162724289 162724421 CAAAATGTTTATAAGGACAGCCTGCTCTCTCCCCTCAGTACAGGGCAGCTGCTTGCCTGTGAACCAGTAAACAGCTCTGTGGTTTCATGGTTGCTCCCTCTCTCCCCAACCCTCACCTCTCAAGGCTGGACT chr1 162724414 162724421 exon_number=5

Any advice would be great! Thanks

With `sed -E 's/.*\<(exon_number=[0-9]+).*/\1/'`, I can only extract [`exon_number=26`](https://ideone.com/GlEfR1). How come you need `7`? — Wiktor Stribiżew, Jul 04 '18 at 14:09
Sorry @WiktorStribiżew it was a mistake. Actually is `exon_number=26`. I will edit the question. — Tato14, Jul 05 '18 at 06:16
Hi @BenjaminW. That should be a good answer if I only want to print this. But, since I have more data in other columns that I would like to keep this is not my best option. That's why I asked for `sed` — Tato14, Jul 05 '18 at 06:19
Try [`sed -E 's/(.* )ID=[^[:space:]]*(exon_number=[0-9]+).*/\1\2/'`](https://ideone.com/jvjJAp) — Wiktor Stribiżew, Jul 05 '18 at 06:35
I will cast a reopen vote since the question is not about `grep`ping a value from text, and [Egrep/Sed: return only the regex match, not the whole line](https://stackoverflow.com/questions/18539494) does not solve the issue. — Wiktor Stribiżew, Jul 05 '18 at 06:44
Thanks @WiktorStribiżew! That solved my problem. Capturing the first part was something I didn't thought (I'm kind of newbie :P). — Tato14, Jul 05 '18 at 06:49
@WiktorStribiżew The duplicate isn't a correct duplicate any longer, that's right. I'd also argue that it was before the question was edited to change input, expected output and wording, though. — Benjamin W., Jul 05 '18 at 13:14

score 2 · Answer 1 · answered Jul 05 '18 at 13:38

With sed, you may match and remove exactly what you want:

sed -E 's/(.* )ID=[^[:space:]]*(exon_number=[0-9]+).*/\1\2/'

See the online sed demo

Explanation

-E - POSIX ERE syntax enabling option
(.* )ID=[^[:space:]]*(exon_number=[0-9]+).* - a rege pattern matching:
- (.* ) - Group 1: any 0+ chars, as many as possible, and then a space
- ID=[^[:space:]]* - ID= and 0+ whitespace chars
- (exon_number=[0-9]+) - exon_number= and 1 or more digits (Group 2)
- .* - the rest of the line
\1\2 - the replacement pattern inserts the contents of Group 1 and 2 into the resulting string.

RavinderSingh13 · Accepted Answer · 2018-07-05T07:07:25.940

1

EDIT: As per OP changed the requirement so putting solution as per that only.

awk -F";" 'match($0,/exon_number=[0-9]+/){val=$1;sub(/ ID.*/,"",val);print val,substr($0,RSTART,RLENGTH)}'  Input_file

Following simple awk may help you here.

awk 'match($0,/exon_number=[0-9]+/){print substr($0,RSTART,RLENGTH)}' Input_file

Solution 2nd: In case your Input_file is having always same kind of data then simply print it by field.

awk -F";" '{print $11}'  Input_file

edited Jul 05 '18 at 07:07

answered Jul 04 '18 at 14:08

RavinderSingh13

130,504
14
57
93

Thanks @RavinderSingh13 for the answer. It's great but this kind of command prints me only this column and I would like to keep the rest. I guess I explained myself very badly. I updated my question with more information to make myself clearer. – Tato14 Jul 05 '18 at 06:29
@Tato14, building on this answer, what about `awk 'match($8,/exon_number=[0-9]+/){$8=substr($8,RSTART,RLENGTH)}' Input_file`? If you don't want to print, don't print. It's your script, after all, you can do what you like. – ghoti Jul 05 '18 at 06:33
Hi @ghoti, thanks for the answer. I guess I make not myself clear again. In this case I found a good option in the comments. – Tato14 Jul 05 '18 at 06:51
@Tato14, try my EDIT solution and let me know on same then? – RavinderSingh13 Jul 05 '18 at 07:07
1

I'm not sure how `print $11` would achieve the "Desired output" you included in your question. But if this is the solution that you *want*, then perhaps you could update your question to match, so that future generations reading this are not confused by the mismatched Q and A. – ghoti Jul 05 '18 at 14:46

Select only a part of string ("name=number" ) with sed

2 Answers2