1

Context: I am working with genes and ontology, but my question concerns R script writing.

I would like to replace the GO:ID in my data frame by their corresponding terms extracted form a database.

So, this is my source data frame. it is a genes list (v1) and associated GO:ID (v2):

>gene_list_and_Go_ID

         V1                                                V2
2563  Gene1    GO:0003871, GO:0008270, GO:0008652, GO:0009086
2580  Gene2    GO:0003871, GO:0008270, GO:0008652, GO:0009086
12686 Gene3    GO:0003871, GO:0008270, GO:0008652, GO:0009086
14523 Gene4                GO:0004489, GO:0006555, GO:0055114

The request to the database looks very simple:

>select(GO.db, my_Go_id, "TERM", "GOID")

I tried the following lines to address manually the database, it worked well:

>my_Go_id = unlist(strsplit("GO:0008270, GO:0008652, GO:0009086", split=", "))
>select(GO.db, my_Go_id, "TERM", "GOID")

    GOID                                     TERM
1 GO:0008270                         zinc ion binding
2 GO:0008652 cellular amino acid biosynthetic process
3 GO:0009086          methionine biosynthetic process

My problem: I cannot make this process automatic! Precisely, for each row, I need to transform each string from column n°2 in my data frame to a vector in order to question the database. And then I need to replace the GO:ID in the data frame by the result of the request.

1/ To start, I tried to put the "unlist" function in a "apply" function to my data frame:

apply(gene_list_and_Go_ID,1,unlist(strsplit(gene_list_and_Go_ID[,2], split=", ")))

I got :

Error in strsplit(ok, split = ", ") : non-character argument

2/ Then, can I add also the request to the database inside the apply function?

3/ Finally, I do not know how to replace column n°2 by the result of the database request.

This is an example of an excepted “ideal” result:

         V1                                                              V2
2563  Gene1        GOID                                                TERM
                   1 GO:0008270                            zinc ion binding
                   2 GO:0008652    cellular amino acid biosynthetic process
                   3 GO:0009086             methionine biosynthetic process

Thanks for your help.

Zampaï
  • 11
  • 6

2 Answers2

2

The proximate issue is that you don't call apply like you did. Instead of writing a function call as you did, you need to provide a function that will take each row/column of the array in turn as input via its first argument, so you want something like (not tested, because you don't need this)

apply(gene_list_and_Go_ID, 1,
      function(x) { unlist(strsplit(x[2], split=", "))})

However, notice that you don't need entire rows of gene_list_and_Go_ID. What you want is to work on the V2 column of gene_list_and_Go_ID. Now also note that strsplit is vectorised, which means if you pass it a vector of length greater than 1 it will work on each element of that vector as if you'd repeatedly called strsplit() on each element of the vector in turn.

Consider the following:

df <- data.frame(V1 = paste0("Gene", 1:4),
                 V2 = c("GO:0003871, GO:0008270, GO:0008652, GO:0009086",
                        "GO:0003871, GO:0008270, GO:0008652, GO:0009086",
                        "GO:0003871, GO:0008270, GO:0008652, GO:0009086",
                        "GO:0004489, GO:0006555, GO:0055114"),
                 stringsAsFactors = FALSE)

Note that V2 needs to be a character vector --- here I used stringsAsFactors = FALSE to stop the automatic coercion character -> factor, but you could also just use as.character(V2) where I have V2 in the code below.

To run strsplit on each element of V2 we could use:

spl <- with(df, strsplit(V2, ", "))

which gets us

> spl
[[1]]
[1] "GO:0003871" "GO:0008270" "GO:0008652" "GO:0009086"

[[2]]
[1] "GO:0003871" "GO:0008270" "GO:0008652" "GO:0009086"

[[3]]
[1] "GO:0003871" "GO:0008270" "GO:0008652" "GO:0009086"

[[4]]
[1] "GO:0004489" "GO:0006555" "GO:0055114"

By the look of the select call, this is a one shot deal - you need to call it for all rows in df (your gene_list_and_Go_ID). If so, just iterate over the elements of the list returned by strsplit():

names(spl) <- with(df, as.character(V1))
term <- lapply(spl, function(x, db) select(db, x, "TERM", "GOID"),
               db = GO.db)

This will return a list where each element is the result of a call to select for a single gene / row of df.

Putting it back together you probably want:

out <- cbind.data.frame(Gene = rep(names(spl), each = lengths(spl)),
                        do.call("rbind", term))

But I can't test the last few parts as I have no idea where select() comes from nor what created GO.db

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Thank you for this highly detailed answer and for your time. Many things are quite new for me (paste0, with, c and rbind, rep, lengths…). I am going to learn more about all these functions, nice! :-) – Zampaï Jul 20 '16 at 07:24
0

Ok, according to Gavin's answer and his kind help, I got the right script. But there was a very important step that blocked me: convert my "gene_list_and_Go_ID" data frame second column from factors to characters. I did this to skip the "non-character argument" error from the "strsplit" function. This post helped me: LINK

So here is my starting data frame:

>gene_list_and_Go_ID

         V1                                                V2
2563  Gene1    GO:0003871, GO:0008270, GO:0008652, GO:0009086
2580  Gene2    GO:0003871, GO:0008270, GO:0008652, GO:0009086
12686 Gene3    GO:0003871, GO:0008270, GO:0008652, GO:0009086
14523 Gene4                GO:0004489, GO:0006555, GO:0055114

Next, the script. The first new line appeared very useful (convert my df from factors to characters):

>gene_list_and_Go_ID <- data.frame(lapply(gene_list_and_Go_ID, as.character), stringsAsFactors=FALSE)

next:

>V_ID <- with(gene_list_and_Go_ID, strsplit(V2, ", "))
>names(V_ID) <- with(gene_list_and_Go_ID, as.character(V1))
>terms <- lapply(V_ID, function(x, db) select(db, x, "TERM", "GOID"), db = GO.db)

Final output is perfect :-) :

> terms
$Gene1
        GOID                        TERM
1 GO:0003871 S-methyltransferase activity
2 GO:0008270 zinc ion binding
3 GO:0008652 cellular amino acid biosynthetic process
4 GO:0009086 methionine biosynthetic process

$Gene2
... etc ...
... etc ...

Note, I skipped the last Gavin's suggestion:

out <- cbind.data.frame(Gene = rep(names(spl), each = lengths(spl)),
                    do.call("rbind", term))

It may be a very elegant script but I have difficulties to understand all what it does, and here is what it generates:

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 16, 15
In addition: Warning message:
In rep(names(V_ID), each = lengths(V_ID)) :
  first element used of 'each' argument

THX

Community
  • 1
  • 1
Zampaï
  • 11
  • 6