R merging thousands of dataframes not giving expected number of columns

Question

I have a list of thousands of dataframes, where each dataframe has a column called "x" and a column called "go". It looks like this:

> lapply(head(genes2Vecs), head)
$A1BG
                    x         go
GO:0005576 0.13793103 GO:0005576
GO:0005615 0.05172414 GO:0005615
GO:0031982 0.24137931 GO:0031982
GO:0043227 0.25862069 GO:0043227
GO:0043230 0.05172414 GO:0043230
GO:1903561 0.03448276 GO:1903561

$A1CF
                    x         go
GO:0005488 0.11111111 GO:0005488
GO:0097159 0.06944444 GO:0097159
GO:1901363 0.06944444 GO:1901363
GO:0003676 0.05555556 GO:0003676
GO:0003723 0.04166667 GO:0003723
GO:0006139 0.13888889 GO:0006139

$AACS
                    x         go
GO:0008152 0.12500000 GO:0008152
GO:0044238 0.02173913 GO:0044238
GO:0071704 0.07065217 GO:0071704
GO:0003824 0.03804348 GO:0003824
GO:0016405 0.01630435 GO:0016405
GO:0016874 0.03260870 GO:0016874

$AARS2
                    x         go
GO:0000166 0.06930693 GO:0000166
GO:0005488 0.27722772 GO:0005488
GO:0008144 0.01980198 GO:0008144
GO:0017076 0.04950495 GO:0017076
GO:0030554 0.02970297 GO:0030554
GO:0032553 0.03960396 GO:0032553

$AATK
                    x         go
GO:0000166 0.10769231 GO:0000166
GO:0005488 0.27692308 GO:0005488
GO:0008144 0.03076923 GO:0008144
GO:0017076 0.07692308 GO:0017076
GO:0030554 0.04615385 GO:0030554
GO:0032553 0.06153846 GO:0032553

$ABAT
                     x         go
GO:0005488 0.054644809 GO:0005488
GO:0008144 0.008196721 GO:0008144
GO:0019842 0.008196721 GO:0019842
GO:0036094 0.010928962 GO:0036094
GO:0043167 0.013661202 GO:0043167
GO:0043168 0.005464481 GO:0043168

I'd like to merge them into something like this

          go A1BG       A1CF       AACS      AARS2      AATK       ABAT
1 GO:0000003   NA 0.06944444         NA         NA        NA 0.01639344
2 GO:0000049   NA         NA         NA 0.00990099        NA         NA
3 GO:0000166   NA         NA 0.03804348 0.06930693 0.1076923         NA
4 GO:0000959   NA         NA         NA 0.02970297        NA         NA
5 GO:0001101   NA         NA 0.01630435         NA        NA         NA
6 GO:0001505   NA         NA         NA         NA        NA 0.01912568

The list I have (genes2Vecs) has a length of 2584, so I'm expecting the resulting dataframe to have that many columns. I've tried two different techinques to merge this list, but both of them give me

An incorrect number of columns
Columns that aren't the names of the elements of the list

I've tried

genes <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "go", all = TRUE), 
                genes2Vecs)

and

genes <- genes2Vecs %>% reduce(full_join, by = "go")

Both of them give me columns named "x...." and have 2436 columns instead of 2584

I'm not sure why this is happening. The merge works fine for small subsets of the list. But for the 2584 dataframes, it seems to mess up somewhere. Can someone suggest a better way to do this, or give me an idea about why it's not merging the correct number of columns?

Thanks

Since you don't provide a [Minimal Reproducible Example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), I can't give you real code, but you could do this: First loop over your list and add a new column with the list element name. Then use `dplyr::bind_rows` or `data.table::rbindlist` to bind all the datasets. Finally, use `tidyr::pivot_wider` or `data.table::dcast` to convert your data from long to wide. — Vincent, Sep 08 '20 at 18:51

score 1 · Accepted Answer · answered Sep 08 '20 at 20:03

I would suggest an approach working with a list where the first element contains all gene id from all dataframes in your list. With that you can use Reduce() to merge all data. Also, if you want each column identified by the name of the element in your list you should change the names in all dataframes in the object. Here the code to complete the task:

Initially, some data taken from your question:

#Data
genes2Vecs <- list(A1BG = structure(list(x = c(0.13793103, 0.05172414, 0.24137931, 
0.25862069, 0.05172414, 0.03448276), go = c("GO:0005576", "GO:0005615", 
"GO:0031982", "GO:0043227", "GO:0043230", "GO:1903561")), class = "data.frame", row.names = c("GO:0005576", 
"GO:0005615", "GO:0031982", "GO:0043227", "GO:0043230", "GO:1903561"
)), A1CF = structure(list(x = c(0.11111111, 0.06944444, 0.06944444, 
0.05555556, 0.04166667, 0.13888889), go = c("GO:0005488", "GO:0097159", 
"GO:1901363", "GO:0003676", "GO:0003723", "GO:0006139")), class = "data.frame", row.names = c("GO:0005488", 
"GO:0097159", "GO:1901363", "GO:0003676", "GO:0003723", "GO:0006139"
)), AACS = structure(list(x = c(0.125, 0.02173913, 0.07065217, 
0.03804348, 0.01630435, 0.0326087), go = c("GO:0008152", "GO:0044238", 
"GO:0071704", "GO:0003824", "GO:0016405", "GO:0016874")), class = "data.frame", row.names = c("GO:0008152", 
"GO:0044238", "GO:0071704", "GO:0003824", "GO:0016405", "GO:0016874"
)), AARS2 = structure(list(x = c(0.06930693, 0.27722772, 0.01980198, 
0.04950495, 0.02970297, 0.03960396), go = c("GO:0000166", "GO:0005488", 
"GO:0008144", "GO:0017076", "GO:0030554", "GO:0032553")), class = "data.frame", row.names = c("GO:0000166", 
"GO:0005488", "GO:0008144", "GO:0017076", "GO:0030554", "GO:0032553"
)), AATK = structure(list(x = c(0.10769231, 0.27692308, 0.03076923, 
0.07692308, 0.04615385, 0.06153846), go = c("GO:0000166", "GO:0005488", 
"GO:0008144", "GO:0017076", "GO:0030554", "GO:0032553")), class = "data.frame", row.names = c("GO:0000166", 
"GO:0005488", "GO:0008144", "GO:0017076", "GO:0030554", "GO:0032553"
)), ABAT = structure(list(x = c(0.054644809, 0.008196721, 0.008196721, 
0.010928962, 0.013661202, 0.005464481), go = c("GO:0005488", 
"GO:0008144", "GO:0019842", "GO:0036094", "GO:0043167", "GO:0043168"
)), class = "data.frame", row.names = c("GO:0005488", "GO:0008144", 
"GO:0019842", "GO:0036094", "GO:0043167", "GO:0043168")))

We will start creating a dataframe with the gene identifiers across all list:

#Extract unique ids
df <- do.call(rbind,genes2Vecs)
df <- df[!duplicated(df$go),]
df$x <- NULL
rownames(df)<-NULL

Now, we will set the name of the list objects for all x variables. We will use a loop:

#Change the names of x variable
for(i in 1:length(genes2Vecs))
{
  names(genes2Vecs[[i]])[1] <- names(genes2Vecs[i])
}

With that done, we will create a list to allocate the single dataframe with all gene identifiers and then join with the original list:

#Feed unique elements to genes list
single <- list(df)
genes2Vecs2 <- c(single,genes2Vecs)

Finally, we merge all dataframes. The key is to use accumulate=F in order to have only one final dataframe:

#Now Merge
genes <- Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "go", all.x = TRUE), 
                genes2Vecs2,accumulate = F)

The output:

           go       A1BG       A1CF       AACS      AARS2       AATK        ABAT
1  GO:0000166         NA         NA         NA 0.06930693 0.10769231          NA
2  GO:0003676         NA 0.05555556         NA         NA         NA          NA
3  GO:0003723         NA 0.04166667         NA         NA         NA          NA
4  GO:0003824         NA         NA 0.03804348         NA         NA          NA
5  GO:0005488         NA 0.11111111         NA 0.27722772 0.27692308 0.054644809
6  GO:0005576 0.13793103         NA         NA         NA         NA          NA
7  GO:0005615 0.05172414         NA         NA         NA         NA          NA
8  GO:0006139         NA 0.13888889         NA         NA         NA          NA
9  GO:0008144         NA         NA         NA 0.01980198 0.03076923 0.008196721
10 GO:0008152         NA         NA 0.12500000         NA         NA          NA
11 GO:0016405         NA         NA 0.01630435         NA         NA          NA
12 GO:0016874         NA         NA 0.03260870         NA         NA          NA
13 GO:0017076         NA         NA         NA 0.04950495 0.07692308          NA
14 GO:0019842         NA         NA         NA         NA         NA 0.008196721
15 GO:0030554         NA         NA         NA 0.02970297 0.04615385          NA
16 GO:0031982 0.24137931         NA         NA         NA         NA          NA
17 GO:0032553         NA         NA         NA 0.03960396 0.06153846          NA
18 GO:0036094         NA         NA         NA         NA         NA 0.010928962
19 GO:0043167         NA         NA         NA         NA         NA 0.013661202
20 GO:0043168         NA         NA         NA         NA         NA 0.005464481
21 GO:0043227 0.25862069         NA         NA         NA         NA          NA
22 GO:0043230 0.05172414         NA         NA         NA         NA          NA
23 GO:0044238         NA         NA 0.02173913         NA         NA          NA
24 GO:0071704         NA         NA 0.07065217         NA         NA          NA
25 GO:0097159         NA 0.06944444         NA         NA         NA          NA
26 GO:1901363         NA 0.06944444         NA         NA         NA          NA
27 GO:1903561 0.03448276         NA         NA         NA         NA          NA

score 0 · Answer 2 · answered Sep 08 '20 at 20:17

I suspect something like this will work. Just stack the tables and then pivot them into a wide format.

library(tidyr)
library(dplyr)

genes2Vecs %>% 
  bind_rows(.id = "gene") %>% 
  pivot_wider("go", names_from = "gene", values_from = "x") %>% 
  arrange(go)

R merging thousands of dataframes not giving expected number of columns

2 Answers2