0

I have a list object called results. This list contains 2 list, each of those list also have multiple columns. I would like to convert it to data frame that combine those columns on each list. I know that we can't combine columns with different length, so is there any way to put NA's for the extra observations. here is a small portion of the list object (results)

results         
[[1]]           
         gene_name  x1                         x2
gene34556    gene1             0                0
gene11169    gene2   0.098757012                0
gene11319    gene3             0                0
gene1459     gene3             0                0
gen168232    gene5             0                0
gene2992     gene6   -1.93960816      0.042291503
gene305454   gene7             0                0
gene3280     gene8             0                0

[[2]]           
            gene_name          x1             x2
gene34556   gene1               0              0
gene11169   gene2    -3.785515694              0
gene11319   gene3               0              0
gene1459    gene4               0              0
gene2992    gene5    -2.308363477   -0.267514619
Mark K.
  • 67
  • 7
  • 4
    Please provide reproducible data and expected output. – MKR May 15 '18 at 18:06
  • Your question is unclear. Can you `dput(results)`? – jdobres May 15 '18 at 18:06
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick May 15 '18 at 18:07

1 Answers1

0

It's not clear how you would like to combine your observations for your list since you have similar gene names in both lists. But here are a few ways you can combine both elements in your list into a single dataframe:

library(data.table)

result <- list(data.frame(gene_name=c("a","b","c"),
                          x1 = rnorm(3),
                          x2 = rnorm(3), 
                          row.names=c("gene34556","gene11169","gene11319"),
                          stringsAsFactors = F), 
               data.frame(gene_name=c("a","b","c", "x"),
                          x1 = rnorm(4),
                          x2 = rnorm(4),
                          row.names=c("gene34556","gene11169","gene11319","gene3280"),
                          stringsAsFactors = F))


# combine list "vertically"
rbindlist(result)
#    gene_name         x1          x2
# 1:         a  0.3522310 -0.31057642
# 2:         b -0.7110728  1.12948383
# 3:         c -1.6032146 -0.87341353
# 4:         a -0.1599496 -1.03543084
# 5:         b -0.1081441  1.93735177
# 6:         c  0.9923114 -0.02319378
# 7:         x -0.8283895  0.72096001

# merge both dataframes within the list:
base:::merge(result[[1]], result[[2]], by="gene_name", all=TRUE)
#   gene_name       x1.x       x2.x       x1.y        x2.y
# 1         a  0.3522310 -0.3105764 -0.1599496 -1.03543084
# 2         b -0.7110728  1.1294838 -0.1081441  1.93735177
# 3         c -1.6032146 -0.8734135  0.9923114 -0.02319378
# 4         x         NA         NA -0.8283895  0.72096001

If the dataframes in the list need to be merged based on the rownames, then use by=0:

# merge both dataframes within the list:
base:::merge(result[[1]], result[[2]], by=0, all=TRUE)
#   Row.names gene_name.x       x1.x       x2.x gene_name.y       x1.y        x2.y
# 1 gene11169           b -0.1694079  2.1168323           b  2.0969813  0.82247288
# 2 gene11319           c  1.5375766 -1.4373368           c  2.0990688 -0.06107935
# 3  gene3280        <NA>         NA         NA           x  0.2528695  1.66448111
# 4 gene34556           a -0.5648451 -0.4891148           a -0.1783414  0.10531560

EDIT:

In case of multiple dataframes in the list:

result <- list(data.frame(gene_name=c("a","b","c"),
                          x1 = rnorm(3),
                          x2 = rnorm(3), 
                          row.names=c("gene34556","gene11169","gene11319"),
                          stringsAsFactors = F), 
               data.frame(gene_name=c("a","b","c", "x"),
                          x1 = rnorm(4),
                          x2 = rnorm(4),
                          row.names=c("gene34556","gene11169","gene11319","gene3280"),
                          stringsAsFactors = F), 
               data.frame(gene_name=c("a","c", "x"),
                          x1 = rnorm(3),
                          x2 = rnorm(3),
                          row.names=c("gene34556","gene11319","gene3280"),
                          stringsAsFactors = F))

# add rownames as a column
new.result <- lapply(result, FUN=function(x){y=cbind(row_name=rownames(x),x, stringsAsFactors=FALSE)})

# merge using base merge() function 
new.result %>%
  Reduce(function(df1,df2) merge(df1,df2, by='row_name', all=TRUE), .)

# The result is the data frame
   row_name gene_name.x        x1.x       x2.x gene_name.y       x1.y       x2.y gene_name         x1         x2
1 gene11169           b  0.80895379 0.02031943           b -0.3121325  0.7952539      <NA>         NA         NA
2 gene11319           c -1.20666887 1.05976176           c  0.4624013 -0.2617053         c  1.6058288  1.5488336
3 gene34556           a -0.01044742 0.11722414           a -0.2593305  1.2252805         a  0.8526598  0.2695985
4  gene3280        <NA>          NA         NA           x  1.0222144  1.6846108         x -0.1128416 -0.4463099

# For large dataset full_join() from dplyr package might perform faster:
new.result %>%
  Reduce(function(df1,df2) full_join(df1,df2, by='row_name'), .)
#    row_name gene_name.x       x1.x        x2.x gene_name.y       x1.y       x2.y gene_name         x1         x2
# 1 gene34556           a  0.8141012 -0.27145107           a -0.1113020 -0.1708712         a -0.4537174 -1.0222622
# 2 gene11169           b -0.2260749  0.09578933           b -1.7803083 -0.9246307      <NA>         NA         NA
# 3 gene11319           c  2.3439445 -1.11945962           c  0.3269329 -1.6452048         c -1.0486770  0.5048081
# 4  gene3280        <NA>         NA          NA           x -1.7521306  0.7690779         x -1.3238697  0.4762742
Katia
  • 3,784
  • 1
  • 14
  • 27
  • I want to combine both elements in the list like that when you used merge function. I would like also to have the list row names as the column names for merging both of them (uniquely column). please see the I adjusted above. Is there any way to keep the column names as they are after merging them. – Mark K. May 15 '18 at 19:16
  • @Mark, It's not clear what you want to get as a result. How you want to incorporate rownames of the original data into the column names of the output? Also what do you want to happen when you have 3 rows with gene2 in our dataset and 1 row with gene2 for the other dataset. It will be much easier to understand what you want if for the input you provided you will also give the desired output. – Katia May 15 '18 at 19:24
  • Katia, What I want to do as follows: 1: I need to merge the two lists inside the main list in one data frame. I want to merge the using the the first column (the list row names). The gene name column is not important for me that's why I asked to merge them by the list row names. – Mark K. May 15 '18 at 19:30
  • OK. Let me do this. – Katia May 15 '18 at 19:36
  • @Mark: I added this to my answer – Katia May 15 '18 at 19:42
  • Katia, It worked good. The other question now, How can we generalize that when we have more than list inside the main list (ex: 100 list) – Mark K. May 15 '18 at 19:49
  • @Mark, If this answers your main question on this page, please "accept" it as an answer. I will try to "generalize" it and will send you the directions. – Katia May 15 '18 at 20:00
  • new.result <- lapply(result, FUN=function(x){y=cbind(row_name=rownames(x),x)}) new.result %>% Reduce(function(df1,df2) merge(df1,df2, by='row_name', all=TRUE), .) – Katia May 15 '18 at 20:15
  • There is also full_join() function from dplyr that will do the same but might work faster for very large datasets. The output will be need to be converted back to a dataframe if you need to have dataframe object back – Katia May 15 '18 at 20:17
  • @Mark. I also updated the answer where I provide both approaches - using merge and full_join. It looks like you are working with genes and I suspect your datasets are large. In this case full_join might work better for you. – Katia May 15 '18 at 20:24