2

I am hoping to determine an efficient way to convert a list of data frames into a single data frame. Below is my reproducible MWE:

set.seed(1)
ABAge = runif(100)
ABPoints = rnorm(100)
ACAge = runif(100)
ACPoints = rnorm(100)
BCAge = runif(100)
BCPoints = rnorm(100)

A_B <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ABAge, Points = ABPoints)
A_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ACAge, Points = ACPoints)
B_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = BCAge, Points = BCPoints)
A_B$ID <- as.character(A_B$ID)
A_C$ID <- as.character(A_C$ID)
B_C$ID <- as.character(B_C$ID)

listFormat <- list("A_B" = A_B, "A_C" = A_C, "B_C" = B_C)

dfFormat <- data.frame(ID = as.character(paste0("ID", 1:100)), A_B.Age = ABAge, A_B.Points = ABPoints, A_C.Age = ACAge, A_C.Points = ACPoints, B_C.Age = BCAge, B_C.Points = BCPoints)
dfFormat$ID = as.character(dfFormat$ID)

This results in a data frame format (dfFormat) that looks like this:

'data.frame':   100 obs. of  7 variables:
 $ ID        : chr  "ID1" "ID2" "ID3" "ID4" ...
 $ A_B.Age   : num  0.266 0.372 0.573 0.908 0.202 ...
 $ A_B.Points: num  0.398 -0.612 0.341 -1.129 1.433 ...
 $ A_C.Age   : num  0.6737 0.0949 0.4926 0.4616 0.3752 ...
 $ A_C.Points: num  0.409 1.689 1.587 -0.331 -2.285 ...
 $ B_C.Age   : num  0.814 0.929 0.147 0.75 0.976 ...
 $ B_C.Points: num  1.474 0.677 0.38 -0.193 1.578 ...

and a list of data frames listFormat that looks like this:

List of 3
 $ A_B:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
  ..$ Points: num [1:100] 0.398 -0.612 0.341 -1.129 1.433 ...
 $ A_C:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.6737 0.0949 0.4926 0.4616 0.3752 ...
  ..$ Points: num [1:100] 0.409 1.689 1.587 -0.331 -2.285 ...
 $ B_C:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.814 0.929 0.147 0.75 0.976 ...
  ..$ Points: num [1:100] 1.474 0.677 0.38 -0.193 1.578 ...

I am hoping to come up with an automated way to convert the dfFormat to listFormat. As can be seen in the above objects there are two main conditions:

1) If there is a common column (name and contents) in each sublist of listFormat (in these examples ID), then they are not repeated in the outputted dfFormat (in this example, it has one final ID column),

2) The rest of the column names in sublists of listFormat become columns in dfFormat and have names such that they retain their sublist name (i.e "A_B") followed by a dot and then their original column name (i.e. Age), so that it becomes (i.e. "A_B.Age") in the dfFormat.

I have tried various unlist() and sapply codes but have been unsuccessful thus far. What is an efficient way to accomplish this?

DJV
  • 4,743
  • 3
  • 19
  • 34
lavenderGem
  • 117
  • 8

4 Answers4

2

You're looking for dplyr::bind_rows:

library(dplyr)    
bind_rows(listFormat, .id = "name")

Output:

  name  ID       Age     Points
1  A_B ID1 0.2655087  0.3981059
2  A_B ID2 0.3721239 -0.6120264
3  A_B ID3 0.5728534  0.3411197
4  A_B ID4 0.9082078 -1.1293631
5  A_B ID5 0.2016819  1.4330237
6  A_B ID6 0.8983897  1.9803999
DJV
  • 4,743
  • 3
  • 19
  • 34
1

Copy listFormat to L in case we need to preserve the input, listFormat. Remove the ID column from each component of L except the first, cbind what we have left together and then fix up the name of the first column. No packages are used.

L <- listFormat
L[-1] <- lapply(L[-1], transform, ID = NULL)
DF <- do.call(cbind, L)
names(DF)[1] <- "ID"

giving:

> str(DF)
'data.frame':   100 obs. of  7 variables:
 $ ID        : chr  "ID1" "ID2" "ID3" "ID4" ...
 $ A_B.Age   : num  0.9932 0.1451 0.6166 0.0372 0.9039 ...
 $ A_B.Points: num  0.4752 0.0288 1.0548 0.6113 0.0651 ...
 $ A_C.Age   : num  0.912 0.761 0.618 0.895 0.507 ...
 $ A_C.Points: num  -0.515 -0.945 0.398 0.502 -1.021 ...
 $ B_C.Age   : num  0.7935 0.2747 0.0487 0.6307 0.3499 ...
 $ B_C.Points: num  -0.963 -1.772 1.716 -0.819 0.577 ...
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

One purrr and dplyr option could be:

imap(listFormat, ~ setNames(.x, paste(rep(.y, length(.x)), names(.x), sep = ".")) %>%
      rename_at(vars(ends_with("ID")), ~ "ID")) %>%
 reduce(full_join, by = "ID")

       ID    A_B.Age   A_B.Points    A_C.Age   A_C.Points    B_C.Age   B_C.Points
1     ID1 0.26550866  0.398105880 0.67371223  0.409401840 0.81425175  1.473881181
2     ID2 0.37212390 -0.612026393 0.09485786  1.688873286 0.92877723  0.677268492
3     ID3 0.57285336  0.341119691 0.49259612  1.586588433 0.14748105  0.379962687
4     ID4 0.90820779 -1.129363096 0.46155184 -0.330907801 0.74982166 -0.192798426
5     ID5 0.20168193  1.433023702 0.37521653 -2.285235535 0.97565735  1.577891795
6     ID6 0.89838968  1.980399899 0.99109922  2.497661590 0.97479246  0.596234109
7     ID7 0.94467527 -0.367221476 0.17635071  0.667066167 0.35062557 -1.173576941
8     ID8 0.66079779 -1.044134626 0.81343521  0.541327336 0.39394906 -0.155642535
9     ID9 0.62911404  0.569719627 0.06844664 -0.013399523 0.95095101 -1.918909820
10   ID10 0.06178627 -0.135054604 0.40044975  0.510108423 0.10664832 -0.195258846
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
0

Given that each data.frame has identical ID columns, in base R it's pretty easy.

as.data.frame(listFormat)

#   A_B.ID   A_B.Age A_B.Points A_C.ID    A_C.Age A_C.Points B_C.ID   B_C.Age B_C.Points
# 1    ID1 0.2655087  0.3981059    ID1 0.67371223  0.4094018    ID1 0.8142518  1.4738812
# 2    ID2 0.3721239 -0.6120264    ID2 0.09485786  1.6888733    ID2 0.9287772  0.6772685
# 3    ID3 0.5728534  0.3411197    ID3 0.49259612  1.5865884    ID3 0.1474810  0.3799627
# 4    ID4 0.9082078 -1.1293631    ID4 0.46155184 -0.3309078    ID4 0.7498217 -0.1927984
# 5    ID5 0.2016819  1.4330237    ID5 0.37521653 -2.2852355    ID5 0.9756573  1.5778918
# 6    ID6 0.8983897  1.9803999    ID6 0.99109922  2.4976616    ID6 0.9747925  0.5962341

You get an ID column for each data.frame, but this can then be easily tidied up

In case you need a more general solution for situations where the id columns of each data.frame differ, then you can do the following using library(data.table)

DTFormat = rbindlist(listFormat, idcol = T)
dcast(DTFormat, ID~.id, value.var = c('Age', 'Points'))
dww
  • 30,425
  • 5
  • 68
  • 111
  • 1
    A very easy solution, however, it may get tricky when the dfs are not ordered on the ID column. – tmfmnk Feb 16 '20 at 13:35
  • 1
    @tmfmnk - In OP the data.frames share the same IDs in the same order, so I think this works. Although your point is correct that it does not generalise to other situations which OP did not ask about. – dww Feb 16 '20 at 13:39