0

I have multiple files with many rows and three columns and need to merge them on the basis of first two columns match. File1

12   13   a
13   15   b
14   17   c
 4    9   d
 .    .   .
 .    .   .
 81   23  h

File 2

12   13   e
3    10   b
14   17   c
 4    9   j
 .    .   .
 .    .   .
 1   2    k

File 3

12   13   m
13   15   k
1     7   x
24    9   d
 .    .   .
 .    .   .
 1   2    h

and so on. I want to merge them to obtain the following result

12   13   a  e  m
13   15   b  k
14   17   c  c
 4    9   d  j
 3   10   b
24    9   d
 .    .   .
 .    .   .
 81   23  h
  1    2  k
  1    7  x
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
user231
  • 31
  • 6

1 Answers1

1

The first thing that usually comes to mind with these types of problems is merge, perhaps in conjunction with a Reduce(function(x, y) merge(x, y, by = "somecols", all = TRUE), yourListOfDataFrames).

However, merge is not always the most efficient function, especially since it looks like you want to "collapse" all the values to fill in the rows from left to right, which would not be the default merge behavior.

Instead, I suggest you stack everything into one long data.frame and reshape it after you have added an index variable.

Here are two approaches:

Option 1: "dplyr" + "tidyr"

  1. Use mget to put all of your data.frames into a list.
  2. Use rbind_all to convert that list into a single data.frame.
  3. Use sequence(n()) in mutate from "dplyr" to group the data and create an index.
  4. Use spread from "tidyr" to transform from a "long" format to a "wide" format.
library(dplyr)
library(tidyr)

combined <- rbind_all(mget(ls(pattern = "^file\\d")))
combined %>%
  group_by(V1, V2) %>% 
  mutate(time = sequence(n())) %>% 
  ungroup() %>% 
  spread(time, V3, fill = "")
# Source: local data frame [7 x 5]
# 
#   V1 V2 1 2 3
# 1  1  7 x    
# 2  3 10 b    
# 3  4  9 d j  
# 4 12 13 a e m
# 5 13 15 b k  
# 6 14 17 c c  
# 7 24  9 d  

Option 2: "data.table"

  1. Use mget to put all of your data.frames into a list.
  2. Use rbindlist to convert that list into a single data.table.
  3. Use sequence(.N) to generate your sequence by your groups.
  4. Use dcast.data.table to convert the "long" data.table into a "wide" one.
library(data.table)

dcast.data.table(
  rbindlist(mget(ls(pattern = "^file\\d")))[, 
    time := sequence(.N), by = list(V1, V2)],
  V1 + V2 ~ time, value.var = "V3", fill = "")
#    V1 V2 1 2 3
# 1:  1  7 x    
# 2:  3 10 b    
# 3:  4  9 d j  
# 4: 12 13 a e m
# 5: 13 15 b k  
# 6: 14 17 c c  
# 7: 24  9 d    

Both of these answers assume we are starting with the following sample data:

file1 <- structure(
  list(V1 = c(12L, 13L, 14L, 4L), V2 = c(13L, 15L, 17L, 9L), 
       V3 = c("a", "b", "c", "d")), .Names = c("V1", "V2", "V3"), 
  class = "data.frame", row.names = c(NA, -4L))
file2 <- structure(
  list(V1 = c(12L, 3L, 14L, 4L), V2 = c(13L, 10L, 17L, 9L), 
       V3 = c("e", "b", "c", "j")), .Names = c("V1", "V2", "V3"), 
  class = "data.frame", row.names = c(NA, -4L))
file3 <- structure(
  list(V1 = c(12L, 13L, 1L, 24L), V2 = c(13L, 15L, 7L, 9L), 
       V3 = c("m", "k", "x", "d")), .Names = c("V1", "V2", "V3"), 
  class = "data.frame", row.names = c(NA, -4L))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485