2

I have 3 dataframes with unequal rows

df1-
T1      T2     T3
1       Joe    TTT
2       PP     YYY
3       JJ     QQQ
5       UU     OOO
6       OO     GGG

df2
X1      X2 
1       09/20/2017
2       08/02/2015
3       05/02/2000
8       06/03/1999

df3
L1       L2
1        New
6        Notsure
9        Also

The final dataframe should be like a left join of all 3 only retaining rows of df1. The matching rows are T1, X1 and L1 but with different header names. The number of rows are different in each dataframe. I couldn't find a solution for this situation. On SO, what i found was available for 2 dataframes or 3 dataframes with equal rows or same column name

    T1      T2     T3         X2            L2 
    1       Joe    TTT        09/20/2017    New
    2       PP     YYY        08/02/2015    NA
    3       JJ     QQQ        05/02/2000    NA
    5       UU     OOO        NA            NA
    6       OO     GGG        NA            NotSure

I am comparatively new in R, and couldn't find a R code for this

Joe
  • 183
  • 5
  • 16

4 Answers4

4

The idea is to put your data frames in a list, change the name of the first column, and use Reduce to merge, i.e.

Reduce(function(...) merge(..., by = 'Var1', all.x = TRUE), 
    lapply( mget(ls(pattern = 'df[0-9]+')), function(i) {names(i)[1] <- 'Var1'; i}))

which gives,

  Var1  T2  T3         X2      L2
1    1 Joe TTT 09/20/2017     New
2    2  PP YYY 08/02/2015     Old
3    3  JJ QQQ 05/02/2000    <NA>
4    5  UU OOO       <NA>    <NA>
5    6  OO GGG       <NA> Notsure
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • What are the 3 dots? Do i need to put it in R the same syntax you have written?. Is there any other way apart from changing it to list and renaming the column name as I am getting the data from multiple excels, and certain other functions need to be performed on that data apart from this merge. – Joe Oct 06 '17 at 14:06
  • 1
    @sotos Darn. That is pretty close to what I was going to post. The two main differences: *1*. by= isn't necessary in `merge` (for the given example at least). *2*. You could use `setNames` in the `lapply`. Neither are necessarily improvements. – lmo Oct 06 '17 at 14:08
  • 1
    @lmo great minds... :)... good suggestions. I like to make it as readable as possible for new users hence the redundant arguments – Sotos Oct 06 '17 at 14:10
  • @Joe, have a look at [this](https://stackoverflow.com/questions/5890576/usage-of-three-dots-or-dot-dot-dot-in-functions). It will give light to `...` – Sotos Oct 06 '17 at 14:11
  • Names of the variables came out as "Var1.1, Var1.2, Var1.3..." until I changed `names(i)[1] <- 'Var1'` to `names(i)[1] <- names(i)[1]` – CrunchyTopping Feb 27 '19 at 15:24
3

using tidyverse functions, you can try:

df1 %>%
  left_join(df2, by = c("T1" = "X1")) %>%
  left_join(df3, by = c("T1" = "L1"))

which gives:

  T1  T2  T3         X2      L2
1  1 Joe TTT 09/20/2017     New
2  2  PP YYY 08/02/2015    <NA>
3  3  JJ QQQ 05/02/2000    <NA>
4  5  UU OOO       <NA>    <NA>
5  6  OO GGG       <NA> Notsure
Aramis7d
  • 2,444
  • 19
  • 25
1

1) sqldf

library(sqldf)
sqldf("select df1.*, X2, L2 
       from df1 
       left join df2 on T1 = X1 
       left join df3 on T1 = L1")

1a) Although slightly longer this variation can make it easier later when reviewing the code by making it explicit as to which source each column came from. If the data frame names were long you might want to use aliases, e.g. from df1 as a, but here we don't bother since they are short.

sqldf("select df1.*, df2.X2, df3.L2 
       from df1 
       left join df2 on df1.T1 = df2.X1 
       left join df3 on df1.T1 = df3.L1")

2) merge Using repeated merge. No packages used.

Merge <- function(x, y) merge(x, y, by = 1, all.x = TRUE)
Merge(Merge(df1, df2), df3)

2a) This could also be written using a magrittr pipeline like this:

library(magrittr)
df1 %>% Merge(df2) %>% Merge(df3)

2b) Using Reduce we can do the repeated merges like this:

Reduce(Merge, list(df1, df2, df3))

Note: The inputs in reproducible form are:

Lines1 <- "
T1      T2     T3
1       Joe    TTT
2       PP     YYY
3       JJ     QQQ
5       UU     OOO
6       OO     GGG"

Lines2 <- "
X1      X2 
1       09/20/2017
2       08/02/2015
3       05/02/2000
8       06/03/1999"

Lines3 <- "
L1       L2
1        New
6        Notsure
9        Also"

df1 <- read.table(text = Lines1, header = TRUE)
df2 <- read.table(text = Lines2, header = TRUE)
df3 <- read.table(text = Lines3, header = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

With left_join() it would be something like this

  df1 = data.frame(X = c("a", "b", "c"), var1 = c(1,2, 3))

  df2 = data.frame(V = c("a", "b", "c"), var2 =c(5,NA, NA) )

  df3 = data.frame(Y = c("a", "b", "c"), var3 =c("name", NA, "age") )

# rename   
df2 = df2 %>% rename(X = V)
df3 = df3 %>% rename(X = Y)

df = left_join(df1, df2, by = "X") %>% 
    left_join(., df3, by = "X")

> df
  X var1 var2 var3
1 a    1    5 name
2 b    2   NA <NA>
3 c    3   NA  age
Edu
  • 903
  • 6
  • 17
  • Hi Edu, I don't have same number of rows, and also the header names of columns are different in each dataframe – Joe Oct 06 '17 at 14:09