R, how to add an unlist (and other) function inside an apply function?

Question

Context: I am working with genes and ontology, but my question concerns R script writing.

I would like to replace the GO:ID in my data frame by their corresponding terms extracted form a database.

So, this is my source data frame. it is a genes list (v1) and associated GO:ID (v2):

>gene_list_and_Go_ID

         V1                                                V2
2563  Gene1    GO:0003871, GO:0008270, GO:0008652, GO:0009086
2580  Gene2    GO:0003871, GO:0008270, GO:0008652, GO:0009086
12686 Gene3    GO:0003871, GO:0008270, GO:0008652, GO:0009086
14523 Gene4                GO:0004489, GO:0006555, GO:0055114

The request to the database looks very simple:

>select(GO.db, my_Go_id, "TERM", "GOID")

I tried the following lines to address manually the database, it worked well:

>my_Go_id = unlist(strsplit("GO:0008270, GO:0008652, GO:0009086", split=", "))
>select(GO.db, my_Go_id, "TERM", "GOID")

    GOID                                     TERM
1 GO:0008270                         zinc ion binding
2 GO:0008652 cellular amino acid biosynthetic process
3 GO:0009086          methionine biosynthetic process

My problem: I cannot make this process automatic! Precisely, for each row, I need to transform each string from column n°2 in my data frame to a vector in order to question the database. And then I need to replace the GO:ID in the data frame by the result of the request.

1/ To start, I tried to put the "unlist" function in a "apply" function to my data frame:

apply(gene_list_and_Go_ID,1,unlist(strsplit(gene_list_and_Go_ID[,2], split=", ")))

I got :

Error in strsplit(ok, split = ", ") : non-character argument

2/ Then, can I add also the request to the database inside the apply function?

3/ Finally, I do not know how to replace column n°2 by the result of the database request.

This is an example of an excepted “ideal” result:

         V1                                                              V2
2563  Gene1        GOID                                                TERM
                   1 GO:0008270                            zinc ion binding
                   2 GO:0008652    cellular amino acid biosynthetic process
                   3 GO:0009086             methionine biosynthetic process

Thanks for your help.

Gavin Simpson · Answer 1 · 2016-07-19T15:16:44.300

The proximate issue is that you don't call apply like you did. Instead of writing a function call as you did, you need to provide a function that will take each row/column of the array in turn as input via its first argument, so you want something like (not tested, because you don't need this)

apply(gene_list_and_Go_ID, 1,
      function(x) { unlist(strsplit(x[2], split=", "))})

However, notice that you don't need entire rows of gene_list_and_Go_ID. What you want is to work on the V2 column of gene_list_and_Go_ID. Now also note that strsplit is vectorised, which means if you pass it a vector of length greater than 1 it will work on each element of that vector as if you'd repeatedly called strsplit() on each element of the vector in turn.

Consider the following:

df <- data.frame(V1 = paste0("Gene", 1:4),
                 V2 = c("GO:0003871, GO:0008270, GO:0008652, GO:0009086",
                        "GO:0003871, GO:0008270, GO:0008652, GO:0009086",
                        "GO:0003871, GO:0008270, GO:0008652, GO:0009086",
                        "GO:0004489, GO:0006555, GO:0055114"),
                 stringsAsFactors = FALSE)

Note that V2 needs to be a character vector --- here I used stringsAsFactors = FALSE to stop the automatic coercion character -> factor, but you could also just use as.character(V2) where I have V2 in the code below.

To run strsplit on each element of V2 we could use:

spl <- with(df, strsplit(V2, ", "))

which gets us

> spl
[[1]]
[1] "GO:0003871" "GO:0008270" "GO:0008652" "GO:0009086"

[[2]]
[1] "GO:0003871" "GO:0008270" "GO:0008652" "GO:0009086"

[[3]]
[1] "GO:0003871" "GO:0008270" "GO:0008652" "GO:0009086"

[[4]]
[1] "GO:0004489" "GO:0006555" "GO:0055114"

By the look of the select call, this is a one shot deal - you need to call it for all rows in df (your gene_list_and_Go_ID). If so, just iterate over the elements of the list returned by strsplit():

names(spl) <- with(df, as.character(V1))
term <- lapply(spl, function(x, db) select(db, x, "TERM", "GOID"),
               db = GO.db)

This will return a list where each element is the result of a call to select for a single gene / row of df.

Putting it back together you probably want:

out <- cbind.data.frame(Gene = rep(names(spl), each = lengths(spl)),
                        do.call("rbind", term))

But I can't test the last few parts as I have no idea where select() comes from nor what created GO.db

Thank you for this highly detailed answer and for your time. Many things are quite new for me (paste0, with, c and rbind, rep, lengths…). I am going to learn more about all these functions, nice! :-) — Zampaï, Jul 20 '16 at 07:24

score 0 · Answer 2 · edited May 23 '17 at 11:44

Ok, according to Gavin's answer and his kind help, I got the right script. But there was a very important step that blocked me: convert my "gene_list_and_Go_ID" data frame second column from factors to characters. I did this to skip the "non-character argument" error from the "strsplit" function. This post helped me: LINK

So here is my starting data frame:

>gene_list_and_Go_ID

         V1                                                V2
2563  Gene1    GO:0003871, GO:0008270, GO:0008652, GO:0009086
2580  Gene2    GO:0003871, GO:0008270, GO:0008652, GO:0009086
12686 Gene3    GO:0003871, GO:0008270, GO:0008652, GO:0009086
14523 Gene4                GO:0004489, GO:0006555, GO:0055114

Next, the script. The first new line appeared very useful (convert my df from factors to characters):

>gene_list_and_Go_ID <- data.frame(lapply(gene_list_and_Go_ID, as.character), stringsAsFactors=FALSE)

>V_ID <- with(gene_list_and_Go_ID, strsplit(V2, ", "))
>names(V_ID) <- with(gene_list_and_Go_ID, as.character(V1))
>terms <- lapply(V_ID, function(x, db) select(db, x, "TERM", "GOID"), db = GO.db)

Final output is perfect :-) :

> terms
$Gene1
        GOID                        TERM
1 GO:0003871 S-methyltransferase activity
2 GO:0008270 zinc ion binding
3 GO:0008652 cellular amino acid biosynthetic process
4 GO:0009086 methionine biosynthetic process

$Gene2
... etc ...
... etc ...

Note, I skipped the last Gavin's suggestion:

out <- cbind.data.frame(Gene = rep(names(spl), each = lengths(spl)),
                    do.call("rbind", term))

It may be a very elegant script but I have difficulties to understand all what it does, and here is what it generates:

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 16, 15
In addition: Warning message:
In rep(names(V_ID), each = lengths(V_ID)) :
  first element used of 'each' argument

THX

R, how to add an unlist (and other) function inside an apply function?

2 Answers2