Our team is going to re-run a gene ontology analysis, and the data format has changed between versions. Manual reformatting is too inefficient.
The old format has a separate line for each GO ID (the "GO:" values):
ENSIPUG00000001371 ;GO:0008236
ENSIPUG00000001371 ;GO:0008233
ENSIPUG00000001371 ;GO:0070011
ENSIPUG00000001371 ;GO:0016787
ENSIPUG00000001371 ;GO:0017171
ENSIPUG00000001371 ;GO:0140096
ENSIPUG00000001374 ;GO:0005515
ENSIPUG00000001374 ;GO:0003674
ENSIPUG00000001374 ;GO:0005488
ENSIPUG00000001375 ;GO:0008152
ENSIPUG00000001375 ;GO:0008150
ENSIPUG00000001375 ;GO:0016758
The new format places related GO IDs (those with the same ENSIPUG) on the same line:
ENSIPUG00000001371 GO:0008236; GO:0008233; GO:0070011; GO:0016787; GO:0017171; GO:0140096
ENSIPUG00000001374 GO:0005515; GO:0003674; GO:0005488
ENSIPUG00000001375 GO:0008152; GO:0008150; GO:0016758
How can the old format be converted to the new one? PS: The spacing, semicolons, and accurate grouping of all terms is very important.
Things we've tried to far: We have tried using regex expressions, but cannot seem to get the correct grouping of the ENSIP*** with the GO terms.
We also used the code below then went to Excel to use the find and replace tool to remove the duplicates, after they were sorted by the ENSIP*** values. That is the current inefficient solution.
go = read.delim("go_hnh2.txt")
go$combo = paste(go$gene_id, ";", go$go_id)
gi = data.frame(go$gene_id)
gi2 = data.frame(go$go_id)
combo= data.frame (go$combo)
# merge by row names (by=0 or by="row.names")
#combo3=merge(gi, gi2, by="row.name", all=TRUE)
write.csv(combo, file="go_edit.csv")