2

New to sed and trying to get the following done, but completely stuck: I am trying to replace a pattern with sed in the second column. This pattern is occuring multiple times.

I have:

Gene1 GO:0000045^biological_process^autophagosome assembly`GO:0005737^cellular_component^cytoplasm
Gene2 GO:0000030^molecular_function^mannosyltransferase activity`GO:0006493^biological_process^protein O-linked glycosylation`GO:0016020^cellular_component^membrane

I want to get:

Gene1 GO:0000045,GO:0005737
Gene2 GO:0000030,GO:0006493,GO:0016020

So getting rid of all the descriptive parts and use "," as the delimiter. I choose to use sed because I thought to easily recognize the pattern between ^ and `. But instead it removes all first GO terms.

Code:

sed -E 's/(^)'.+'(`)/,/g'

Can someone help me?

T_R
  • 99
  • 8

3 Answers3

0

Try this, shown as two steps for illustration

$ # showing how to remove from ^ to ` and replace with ,
$ sed 's/\^[^`]*`/,/g' ip.txt
Gene1 GO:0000045,GO:0005737^cellular_component^cytoplasm
Gene2 GO:0000030,GO:0006493,GO:0016020^cellular_component^membrane

$ # removing remaining data from ^ to end of line as well
$ sed 's/\^[^`]*`/,/g; s/\^.*//' ip.txt
Gene1 GO:0000045,GO:0005737
Gene2 GO:0000030,GO:0006493,GO:0016020
  • since ^ is a metacharacter, use \^ to match it literally
  • [^`]* will match zero or more non ` characters
  • don't use \^.*`, this will delete from first ^ to last backtick in the line due to greedy nature of quantifiers
Sundeep
  • 23,246
  • 2
  • 28
  • 103
0
sed -e 's/\^[^`]*//g' -e 's/`/,/g' your_file

First command remove (substitute by nothing) any character except ` behind ^ (included)

Second substitute ` by ,

lojza
  • 1,823
  • 2
  • 13
  • 23
0

Identifying the individual fields and then operating on each of those would probably be more useful long-term than just identifying parts of each line with regexps:

$ awk -F'^' -v OFS=',' '{print NR") "$0; for (i=1;i<=NF;i++) print "\t"i") "$i}' file
1) Gene1 GO:0000045^biological_process^autophagosome assembly`GO:0005737^cellular_component^cytoplasm
        1) Gene1 GO:0000045
        2) biological_process
        3) autophagosome assembly`GO:0005737
        4) cellular_component
        5) cytoplasm
2) Gene2 GO:0000030^molecular_function^mannosyltransferase activity`GO:0006493^biological_process^protein O-linked glycosylation`GO:0016020^cellular_component^membrane
        1) Gene2 GO:0000030
        2) molecular_function
        3) mannosyltransferase activity`GO:0006493
        4) biological_process
        5) protein O-linked glycosylation`GO:0016020
        6) cellular_component
        7) membrane

.

$ awk -F'^' -v OFS=',' '{out=$1; for (i=2;i<=NF;i++) if (sub(/.*`/,"",$i)) out=out OFS $i; print out}' file
Gene1 GO:0000045,GO:0005737
Gene2 GO:0000030,GO:0006493,GO:0016020
Ed Morton
  • 188,023
  • 17
  • 78
  • 185