0

Our team is going to re-run a gene ontology analysis, and the data format has changed between versions. Manual reformatting is too inefficient.

The old format has a separate line for each GO ID (the "GO:" values):

ENSIPUG00000001371 ;GO:0008236
ENSIPUG00000001371 ;GO:0008233
ENSIPUG00000001371 ;GO:0070011
ENSIPUG00000001371 ;GO:0016787
ENSIPUG00000001371 ;GO:0017171
ENSIPUG00000001371 ;GO:0140096
ENSIPUG00000001374 ;GO:0005515
ENSIPUG00000001374 ;GO:0003674
ENSIPUG00000001374 ;GO:0005488
ENSIPUG00000001375 ;GO:0008152
ENSIPUG00000001375 ;GO:0008150
ENSIPUG00000001375 ;GO:0016758

The new format places related GO IDs (those with the same ENSIPUG) on the same line:

ENSIPUG00000001371  GO:0008236; GO:0008233; GO:0070011; GO:0016787; GO:0017171; GO:0140096
ENSIPUG00000001374  GO:0005515; GO:0003674; GO:0005488
ENSIPUG00000001375  GO:0008152; GO:0008150; GO:0016758

How can the old format be converted to the new one? PS: The spacing, semicolons, and accurate grouping of all terms is very important.

Things we've tried to far: We have tried using regex expressions, but cannot seem to get the correct grouping of the ENSIP*** with the GO terms.

We also used the code below then went to Excel to use the find and replace tool to remove the duplicates, after they were sorted by the ENSIP*** values. That is the current inefficient solution.

go = read.delim("go_hnh2.txt")

go$combo = paste(go$gene_id, ";", go$go_id)

gi = data.frame(go$gene_id) 
gi2 = data.frame(go$go_id)
combo= data.frame (go$combo)

# merge by row names (by=0 or by="row.names")
#combo3=merge(gi, gi2, by="row.name", all=TRUE)  

write.csv(combo, file="go_edit.csv")
Hana Hess
  • 1
  • 2
  • 3
    Welcome to Stack Overflow. Please [make this question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including example data in a plain text format. We cannot copy/paste data from images. Also your second example looks incomplete. – neilfws Feb 01 '22 at 00:43
  • How is the output related to the input? Is the output correct? – Onyambu Feb 01 '22 at 00:48
  • 3
    So ... `substring("ENSIPUG00000001371 ;GO:0008236", 1, 9)` returns `"ENSIPUG00"`, which means that `substring(vec, 1, 9)` will do the same for all strings in a `character` vector named `vec`. Perhaps I'm missing something? – r2evans Feb 01 '22 at 01:04
  • 2
    I suggest that you clarify the expected outputs. It is easy to subset a string as r2evans pointed out, so the current expected output does not make sense to me. I guess you failed to share the correct expected output. Perhaps add the expected in a reproducible format, not an image. – www Feb 01 '22 at 02:42
  • The ENSI... values in your example data aren't in your desired output, so this *still* isn't reproducible, because we can't see how you're getting from one to the other. Some explanation in words of what exactly you're trying to do would help make this clearer – camille Feb 01 '22 at 20:19
  • Corss-posted at https://www.biostars.org/p/9508848/ – zx8754 Feb 02 '22 at 21:54

0 Answers0