5

I have lists of variable length with dataframes. I want to merge the dfs in each list into a single df using a specified column name or index that varies by df. Here's an example with 3 dfs

my.list <- list(
data.frame(a = 1:10, b = letters[1:10], c = 101:110),
data.frame(d = 6:15, e = letters[1:10], f = 1:10),
data.frame(l = 2:11, m = letters[11:20], o = 1:10))

and I want to merge by a specific column of each df mentioned in ids

ids <- c('a', 'f', 'l')

to get something that looks like

id  b   c   d   e   m   o
1   a   101 6   a   NA  NA
2   b   102 7   b   k   1
3   c   103 8   c   l   2
4   d   104 9   d   m   3
5   e   105 10  e   n   4
6   f   106 11  f   o   5
7   g   107 12  g   p   6
8   h   108 13  h   q   7
9   i   109 14  i   r   8
10  j   110 15  j   s   9
11  NA  NA  NA  NA  t   10

I've tried to do this with merge and/or Reduce, but failed to pass on the ids

Lukas
  • 655
  • 8
  • 20
  • 1
    Not clear about the `by` variables. You have column names different in each dataset. Are you saying that alll the dataset should be merged by all the column sin 'ids' – akrun Jul 21 '17 at 12:25
  • yes, I edited the question a bit. Essentially I know the by variables (ids) but they differ among the dfs – Lukas Jul 21 '17 at 12:28
  • Can one dataframe have both a and f columns? – zx8754 Jul 21 '17 at 12:30
  • 1
    yes. that wasn't clear from my question – Lukas Jul 21 '17 at 12:31
  • 2
    and which column shall be used if there are for example a and f present in a dataframe? – talat Jul 21 '17 at 12:33
  • @docendo discimus: the column mentioned in ids – Lukas Jul 21 '17 at 12:49
  • There are some discussions about how to read `ids`. My understanding is that the first element of `ids` specifies the column in the first data.frame to merge on, the second element of `ids` belongs to the second data.frame in the list, and so forth. (Which means that `ids` must have exact the same number of elements as there are data.frames in the list). Is this the correct interpretation? – Uwe Jul 21 '17 at 16:38

5 Answers5

7

We can change the names unique to all the list elements by changing the column name that corresponds to 'ids' with 'id' and then do the Reduce with merge

lst <- Map(function(x, y) {names(x)[match(y, names(x))] <- 'id'; x}, my.list, ids)
Reduce(function(...) merge(..., by = 'id', all = TRUE), lst)
#   id    b   c  d    e    m  o
#1   1    a 101  6    a <NA> NA
#2   2    b 102  7    b    k  1
#3   3    c 103  8    c    l  2
#4   4    d 104  9    d    m  3
#5   5    e 105 10    e    n  4
#6   6    f 106 11    f    o  5
#7   7    g 107 12    g    p  6
#8   8    h 108 13    h    q  7
#9   9    i 109 14    i    r  8
#10 10    j 110 15    j    s  9
#11 11 <NA>  NA NA <NA>    t 10
akrun
  • 874,273
  • 37
  • 540
  • 662
  • dataframes can have more than 1 matching column names, see comments. – zx8754 Jul 21 '17 at 12:39
  • 1
    @zx8754 THe solution is based on the post `and I want to merge by a specific column of each df mentioned in ids` If there are more than 1 matching names, the question and the example should be changed – akrun Jul 21 '17 at 12:39
  • Agree, I had similar solution in mind, but as we can see question is evolving unfortunately. – zx8754 Jul 21 '17 at 12:40
  • @zx8754 If there are more than matches, then 'ids' should be a `list` specific for each `list` element. Not sure how that example turns out – akrun Jul 21 '17 at 12:42
6

Here is a data.table answer with a similar approach as @akrun's answer.

However, instead of renaming the columns, we'll set them as keys. Then we can merge by keys, rather than by name. This preserves the column names.

library(data.table)

funky <- function(x) {
  setDT(my.list[[x]])
  setkeyv(my.list[[x]], ids[x])
  return(NULL)
}

So this function will be passed an index x. First, it will set the data.frame in the xth position of my.list to data.table. Then, it will set the key of this new data.table based on the column name specified at the same position in ids. Finally, since this is all done in place, return NULL to prevent useless printout.

Now apply the function to all of the objects in the list.

a <- lapply(seq_along(ids), funky)
Reduce(function(x, y) merge(x, 
                            y, 
                            by.x = key(x), 
                            by.y = key(y), 
                            all = TRUE), 
       my.list)

Unpacking the Reduce, we can specify the columns to merge by using key(x) and key(y). This is the step that lets us avoid modifying the column names.

#      a  b   c  d  e  m  o
#  1:  1  a 101  6  a NA NA
#  2:  2  b 102  7  b  k  1
#  3:  3  c 103  8  c  l  2
#  4:  4  d 104  9  d  m  3
#  5:  5  e 105 10  e  n  4
#  6:  6  f 106 11  f  o  5
#  7:  7  g 107 12  g  p  6
#  8:  8  h 108 13  h  q  7
#  9:  9  i 109 14  i  r  8
# 10: 10  j 110 15  j  s  9
# 11: 11 NA  NA NA NA  t 10
Eric Watt
  • 3,180
  • 9
  • 21
  • I believe this is the only correct answer so far because it implements the request of the OP _I want to merge by a specific column of each df mentioned in ids_. – Uwe Jul 21 '17 at 13:55
5

An idea could be to convert the columns of interest into rownames and then merge on rownames, i.e.

l1 <- Map(function(x, y) {rownames(x) <- x[[y]]; x}, my.list, ids)
Reduce(function(x, y)merge(x, y, all = TRUE), lapply(l1, function(x) 
                                                 data.frame(x, id = rownames(x))))

#   id  a    b   c  d    e  f  l    m  o
#1   1  1    a 101  6    a  1 NA <NA> NA
#2  10 10    j 110 15    j 10 10    s  9
#3   2  2    b 102  7    b  2  2    k  1
#4   3  3    c 103  8    c  3  3    l  2
#5   4  4    d 104  9    d  4  4    m  3
#6   5  5    e 105 10    e  5  5    n  4
#7   6  6    f 106 11    f  6  6    o  5
#8   7  7    g 107 12    g  7  7    p  6
#9   8  8    h 108 13    h  8  8    q  7
#10  9  9    i 109 14    i  9  9    r  8
#11 11 NA <NA>  NA NA <NA> NA 11    t 10
Sotos
  • 51,121
  • 6
  • 32
  • 66
3

@Frank made a comment which made me think about a simple, straightforward loop:

# initialise result
result <- my.list[[1L]]
# add/merge remaining data.frames from list using the given column in ids to merge on
for (i in tail(seq_along(my.list), -1L)) {
  result <- merge(result, my.list[[i]], by.x = ids[1L], by.y = ids[i], all = TRUE)
}
result
    a    b   c  d    e    m  o
1   1    a 101  6    a <NA> NA
2   2    b 102  7    b    k  1
3   3    c 103  8    c    l  2
4   4    d 104  9    d    m  3
5   5    e 105 10    e    n  4
6   6    f 106 11    f    o  5
7   7    g 107 12    g    p  6
8   8    h 108 13    h    q  7
9   9    i 109 14    i    r  8
10 10    j 110 15    j    s  9
11 11 <NA>  NA NA <NA>    t 10

This approach doesn't require to rename a single column of any of the data.frames in the list before the merges. However, to be consistent with the expected result of the OP, the id column may be renamed afterwards:

tmp <- colnames(result)
colnames(result) <- replace(tmp, tmp == ids[1L], "id")
result
   id    b   c  d    e    m  o
1   1    a 101  6    a <NA> NA
2   2    b 102  7    b    k  1
3   3    c 103  8    c    l  2
4   4    d 104  9    d    m  3
5   5    e 105 10    e    n  4
6   6    f 106 11    f    o  5
7   7    g 107 12    g    p  6
8   8    h 108 13    h    q  7
9   9    i 109 14    i    r  8
10 10    j 110 15    j    s  9
11 11 <NA>  NA NA <NA>    t 10

Note the OP has pointed out several times that the ids vector contains the name of the column to be merged for each of the data.frames:

I want to merge by a specific column of each df mentioned in ids, and Essentially I know the by variables (ids) but they differ among the dfs

Therefore, I'm afraid the answers using match() might be wrong.

Uwe
  • 41,420
  • 11
  • 90
  • 134
-2

to merge something i can advice you to use the command sqldf from the package sqldf and you can do it like this:

A = data.frame(a = 1:10, b = letters[1:10], c = 101:110)
B = data.frame(d = 6:15, e = letters[1:10], f = 1:10)
C = data.frame(l = 2:11, m = letters[11:20], o = 1:10)
joined_df <- sqldf('select A.*,B.*,C.* from A left join B on A.a=B.f left join C on A.a=C.l')
Orhan Yazar
  • 909
  • 7
  • 19