3

I need to merge two dataframes x and y which have about 50 columns in common and some unique columns, and I need to keep all the rows from x.

It works if I run:

 NewDataframe <- merge(x, y, by=c("ColumnA", "ColumnB", "ColumnC"),all.x=TRUE)

The issue is that there are more than 50 common columns, and I would rather avoid typing the names of all the common columns.

I have tried with:

 NewDataframe <- merge(x, y, all.x=TRUE)

But the following error appears:

 Error in merge.data.table(x, y, all.x = TRUE) :
 Elements listed in `by` must be valid column names in x and y

Is there any way of using by with the common columns without typing all of them, but keeping all the rows from x?

Thank you.

dede
  • 1,129
  • 5
  • 15
  • 35

3 Answers3

6

You want to merge based on all common columns. So first you need to find out which column names are common between the two dataframes.

common_col_names <- intersect(names(x), names(y))

Then you use this character vector as your by parameters in the merge function.

merge(x, y, by=common_col_names, all.x=TRUE)

Edit: after reading @Andrew Gustar's answer, I double checked the documentation for the merge function, and this is exactly the default by parameter:

## S3 method for class 'data.frame'
merge(x, y, by = intersect(names(x), names(y)), # <-- Look here
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"),
      incomparables = NULL, ...)
Community
  • 1
  • 1
zelite
  • 1,478
  • 16
  • 37
  • With all.y=T, I correctly get the values that are in dataframe Y, but it also adds all columns to the final merged dataframe, not just the common columns. All columns are passed. – PM0087 Jan 20 '21 at 17:05
5

I think you have data tables rather than simple dataframes, and merge works slightly differently between the two. You could try forcing it to use the dataframe method by using NewDataframe <- merge.data.frame(x, y, all.x=TRUE) which should by default merge on all shared column names.

Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
2

The left_join function from dplyr should give you exactly what you want.

First create some dummy data...

shared <- tibble(a=rnorm(327), b=rnorm(327),c=rnorm(327),d=rnorm(327),e=rnorm(327),f=rnorm(327),g=rnorm(327))
x <- cbind(shared, tibble(v =rnorm(327),w=rnorm(327),x=rnorm(327),y=rnorm(327),z=rnorm(327)))
y <- cbind(shared, tibble(h =rnorm(327),i=rnorm(327),j=rnorm(327),k=rnorm(327),l=rnorm(327)))
rm(shared)

Then just run the join...

dplyr::left_join(x,y)
# Joining, by = c("a", "b", "c", "d", "e", "f", "g")

As you can see, the function figures out which columns are shared between the two and automatically uses them in the join.

Andrew Brēza
  • 7,705
  • 3
  • 34
  • 40