merge/join multiple dataframe based on column ID

Question

dataframe 1 :

    ID          species          sites   count
    411     Androsace halleri    A24C     2
    785     Bartsia alpina       A28B     1
    222     Carex cespitosa      C97Z     3
    125     Cicuta viros         D47S     1

dataframe 2 :

    ID          species          YEAR    observer   sites
    411     Androsace halleri    2018     James     A24C     
    785     Bartsia alpina       2019     Nikos     P89I
    222     Carex cespitosa      2018     Enzo      Y54K     
    125     Cicuta viros        2020     David     D47S

wanted dataframe :

    ID          species          YEAR    observer   sites  count  sites.y
    411     Androsace halleri    2018     James     A24C     2 
    785     Bartsia alpina       2019     Nikos     A28B     1     P89I           
    222     Carex cespitosa      2018     Enzo      C97Z     3     Y54K           
    125     Cicuta viros         2020     David     D47S     1

I did this:

wanted_dataset <- dataset_1 %>% 
  full_join(dataset_2, by="ID")

but I get something like this :

ID         species.x       sites.x  count   species.y           YEAR    observer    sites.y
411     Androsace halleri   A24C      2      Androsace halleri  2018    James        A24C
785     Bartsia alpina      A28B      1      Bartsia alpina     2019    Nikos        P89I
222     Carex cespitosa     C97Z      3      Carex cespitosa    2018    Enzo         Y54K
125     Cicuta viros        D47S      1      Cicuta viros       2020    David        D47S

and I don't want the columns "species" to be repeated if all the rows of the dataframe 1 match with all the rows of the dataframe 2.

I'm aware I can do this:

 wanted_dataset <- test1 %>% 
    full_join(test2, by=c("ID", "species"))

but my actual problem concerns 14 dataframes containing each 30 columns where one third of them share the sames names between the dataframes so I was looking for a way to match and merge automatically columns that have the same names and have the same rows.

(tell me if I'm not clear I would try to better explain my problem)

If you omit the `by` argument, the join function will use all columns with the same names in both data.frames for the join... — dario, Oct 20 '21 at 09:05
Relevant post if merging multiple dataframes: https://stackoverflow.com/q/8091303/680068 — zx8754, Oct 20 '21 at 09:09
Note that, in your example case, `sites` will be used as well to join the data.frames, you will hence have 2 rows for any entry with same ID and species but different sites — Cath, Oct 20 '21 at 09:11
@Cath dario's answer enabled me to achieve what I wanted but I'm curious: do you know if there is a way to do what I wanted in my question without getting 2 rows for any entry with same ID and species but different sites ? — zakros, Oct 20 '21 at 09:35
@zakros I posted an answer with a way, but potentially time consuming. I wonder if the best way (more efficient) way to go wouldn't be to merge all by ID and then delete potential duplicated columns, like I suggest at the end of my post. — Cath, Oct 20 '21 at 09:41

Cath · Answer 1 · 2021-10-20T09:46:22.490

One way to go, but which can be very time consuming if you have "too many" data.frame is to create a function to check which columns are totally duplicates and merge by them. Then you can put your data.frames in a list and use your function on all data.frames with Reduce on that list.

mergeColOK <- function(x1, x2){
    # get common names
    colComm <- intersect(names(x1), names(x2))
    # combine all common column names 2 by 2
    combCol <- combn(colComm, 2)
    # for each pair of column names, check if all data are the same (if same value for first column, then same value for second columns
    colOK <- apply(combCol, 2, 
                     function(nomCols)all(x1[x1[[nomCols[1]]] %in% x2[[nomCols[1]]], nomCols]==x2[x2[[nomCols[1]]] %in% x1[[nomCols[1]]], nomCols]))
    # get all columns
      colToUse <- unique(as.vector(combCol[, colOK]))
    # finally, merge your data
      merge(x1, x2, by=colToUse, all=TRUE) # I prefer to use base R but `dplyr` join would work the same
}

the function on your 2 data.frames:

mergeColOK(df1, df2)
#         ID   species sites.x count YEAR observer sites.y
#1 Androsace   halleri    A24C     2 2018    James    A24C
#2   Bartsia    alpina    A28B     1 2019    Nikos    P89I
#3     Carex cespitosa    C97Z     3 2018     Enzo    Y54K
#4    Cicuta     viros    D47S     1 2020    David    D47S

Another option would be to merge only by ID and then check for complete duplicated columns and remove them (there are posts on SO explaining how to do that: Delete Redundant columns in R ; Identifying duplicate columns in a dataframe)

Thanks for the detailled reply ! I have too many dataframes to use your function but it may be useful for other data cleaning opeartions. — zakros, Oct 20 '21 at 09:53
@zakros if you already have a subset of column names to work with (in your example, "ID", "sepcies" and "sites), you might want to include them in the function as the "ensembl" to check from, this might lower the time needed to run on all data.frames — Cath, Oct 20 '21 at 11:22

U13-Forward · Answer 2 · 2021-10-20T09:18:40.983

0

Use setdiff:

wanted_dataset <- dataset_1 %>% 
  full_join(dataset_2, by=setdiff(colnames(.), "sites"))

edited Oct 20 '21 at 09:18

answered Oct 20 '21 at 09:08

U13-Forward

69,221
14
89
114

this won't give desired output – Cath Oct 20 '21 at 09:15
1

still will not fulfill OP needs: they have lots of data.frames and don't want to specify columns to merge by each time (reading carefully the Q before answering may be useful + checking answer: as all columns from first data.frame are not in second data.frame, your code gives an error) – Cath Oct 20 '21 at 09:19

score 0 · Answer 3 · answered Oct 20 '21 at 09:22

0

I assume that the shared column are the same for all the dataframe, you can extract these column names and store them in a variable then do the join and use this variable for the by argument.

For example:

dat1 <- data.frame(A = LETTERS[1:5], B = 1:5, C = 5:9)                
dat2 <- data.frame(A = LETTERS[2:6], B = 2:6, D = 5:9)                

shared <- names(dat1)[names(dat1) %in% names(dat2)]
print(shared)

[1] "A" "B"

#join by A and B
result = inner_join(dat1, dat2, by = shared)

print(result)

	A	B	C	D
1	B	2	6	5
2	C	3	7	6
3	D	4	8	7
4	E	5	9	8

answered Oct 20 '21 at 09:22

Glastos

47
10

This is just useless supplementary steps to do the same as not specifying `by` argument – Cath Oct 20 '21 at 09:26
Thanks for your answer Glastos but it would be far too time consuming for my case. I used dario's answer in the comment :) – zakros Oct 20 '21 at 09:29
I totally agree with you @Cath but this can give you better control if you want to check by which column the jointure will be done. It can be useful if you are dealing with big data as join can take some time in this case. – Glastos Oct 20 '21 at 09:30

merge/join multiple dataframe based on column ID

3 Answers3