Merging multiple dataframes by 3 common columns in R

Question

I have 3 data frames that I am trying to merge/join. I've tried these two solutions: Merge multiple data.frames in R with varying row length, Merge data.frames with duplicates. However, the output datatables aren't what I'm looking for.

Here is a sample code of my dataframes:

df1 <- data.frame(FzL = c(594.4014, 594.4147, 594.4148, 594.4194, 594.3877, 618.8600), task = c("hop", "hop", "hop", "vj", "vj", "vj"), 
                    limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))
df2 <- data.frame(FzR = c(594.2836, 619.1613, 618.8364, 594.4196, 694.3853, 640.2640), task = c("hop", "hop", "hop", "vj", "vj", "vj"), 
                    limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))
df3 <- data.frame(Frame = c(219388, 219389, 219390, 211387, 211388, 211389), Time = c("2020-06-05 13:26:39", "2020-06-05 13:26:39", "2020-06-05 13:26:39",
       "2020-06-05 13:26:39", "2020-06-05 13:26:39", "2020-06-05 13:26:39"),
       task = c("hop", "hop", "hop", "vj", "vj", "vj"), limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))

When trying to merge with this code:

 JOIN <- merge(df3, merge(df1, df2, by = c("task", "limb", "trial"), all = TRUE), by = c("task", "limb", "trial"), all = TRUE)

I get a table that duplicates the rows multiple times. I've also tried this code:

run.seq <- function(x) as.numeric(ave(paste(x), x, FUN = seq_along))

L <- list(df1, df2, df3)
L2 <- lapply(L, function(x) cbind(x, run.seq = run.seq(x$limb)))

out <- Reduce(function(...) merge(..., all = TRUE), L2)

However, it only gives me the first 3 rows and doesn't run through the whole dataset.

My final data table should have 7 columns: task, limb, trial, FzL, FzR, Frame, Time.

Any help would be much appreciated! Thank you.

When I try: out <-Reduce(merge, L) instead of L2 it gives me a table that duplicates rows as well. — mpvalenc, Jun 25 '20 at 04:47
Why do you need `L2` ? What is the expected output that you are looking for? — Ronak Shah, Jun 25 '20 at 04:59
@RonakShah I responded to your answer on my other post https://stackoverflow.com/questions/62482411/joining-data-in-lists-of-list, same issue I just tried do it another way this time — mpvalenc, Jun 25 '20 at 05:29

Chris · Accepted Answer · 2020-06-25T19:29:58.643

2

In a merge the function does not know which FzL value corresponds to which FzR value. As a result it will create each possible combination.

If the dataframes are in the same exact order across all three (i.e. the first row of df1 for FzL of 594.4014 corresponds to the first row of df2 for FzR of 594.2836), then you can instead bind columns to join them together (only if you are certain that each row corresponds to the same rows in the other dataframes).

In that case a column bind may be what you are looking for, given there are the same number of rows and identifiers in each dataframe in this example.

# Base R
df <- cbind(df1,
            subset(df2, select = c("FzR")),
            subset(df3, select = c("Frame", "Time")))

# Tidyverse
library(dplyr)
df <- df1 %>% 
  bind_cols(df2 %>% select(FzR)) %>% 
  bind_cols(df3 %>% select(Frame, Time))

Update after comment that df3 has different number of rows:

Another option is to still merge, but if the order is the same across all dataframes can use the row number to show which row corresponds to which. This is an easier route with one dataframe that has fewer rows.

library(dplyr)

df1 <- df1 %>% 
  mutate(id = row_number())
df2 <- df2 %>% 
  mutate(id = row_number())
df3 <- df3 %>% 
  mutate(id = row_number())

df <- df1 %>% 
  full_join(df2) %>% 
  full_join(df3)

edited Jun 25 '20 at 19:29

answered Jun 25 '20 at 07:15

Chris

266
1
6

in my actual dataset, Frame and Time columns don't have as many rows as FzR and FzL but each row does correspond to the other corresponding row in another dataframe/column. Would this solution still work? Also, can you explain the difference between the cbind versus bind_cols solutions? I got the same output for each. Thanks! – mpvalenc Jun 25 '20 at 17:14
Yes I think would still work as the corresponding order is still correct. Will adjust the comment on the two solutions - it's just two ways of doing the same thing (base R and tidyverse way)... – Chris Jun 25 '20 at 18:06
I received an error: "Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 18400, 911" because the Frame and Time columns don't have as many rows in my real dataset. Is there a solution to work around this issue? Thanks! – mpvalenc Jun 25 '20 at 18:17
You would have to expand that dataframe to then match the categories as df1 and df2, but with NAs for Time and Frame. Getting a little hacky, but again if previous rows correspond then should work - will edit the answer with an attempt – Chris Jun 25 '20 at 18:43
I'd like NA's to be filled in the Frame and Time column until the rows match again at vj task, R limb, and trial2 trial columns – mpvalenc Jun 25 '20 at 18:57
basically I'd have to insert NA's into the Frame and Time column to match the exact number of rows as the FzR and FzL columns? Fairly new to R, I appreciate the help! – mpvalenc Jun 25 '20 at 19:05
Yes that's exactly what you'd have to do - as you've seen from the last error need to have the same number of rows so if you don't have the data could fill with NAs and should work – Chris Jun 25 '20 at 19:08

score 0 · Answer 2 · answered Jun 25 '20 at 08:13

Here is a slightly longer solution such that each value of the FzL and FzR variable corresponds to the given row number and there are no repetition of values. It has been done using the dplyr package.

library(dplyr)
df1 <- data.frame(FzL = c(594.4014, 594.4147, 594.4148, 594.4194, 594.3877, 618.8600), task = c("hop", "hop", "hop", "vj", "vj", "vj"), 
                  limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))
df2 <- data.frame(FzR = c(594.2836, 619.1613, 618.8364, 594.4196, 694.3853, 640.2640), task = c("hop", "hop", "hop", "vj", "vj", "vj"), 
                  limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))
df3 <- data.frame(Frame = c(219388, 219389, 219390, 211387, 211388, 211389), Time = c("2020-06-05 13:26:39", "2020-06-05 13:26:39", "2020-06-05 13:26:39",
                                                                                      "2020-06-05 13:26:39", "2020-06-05 13:26:39", "2020-06-05 13:26:39"),
                  task = c("hop", "hop", "hop", "vj", "vj", "vj"), limb = c("L", "L", "L", "R", "R", "R"), trial = c("trial1", "trial1", "trial1", "trial2", "trial2", "trial2"))

df4 <- df1 %>% 
    left_join(df2, by = c("FzL" = "FzR"))
df4 <- df4[,-c(5:7)]
df4 <- df4 %>% 
    mutate(FzR = df2[ ,1])

df5 <- df4 %>% 
    left_join(df3, by = c("FzL" = "Frame"))
df5 <- df5[,-c(6:9)]
df5 <- df5 %>% 
    mutate(Frame = df3[ ,c(1)],
           Time = df3[ ,c(2)])
df5 <- df5 %>% 
    rename(task = task.x, limb = limb.x, trial = trial.x,) %>% 
    select(task, limb, trial, FzL, FzR, Frame, Time)
df5

The output is as follows:-

task   limb  trial      FzL      FzR  Frame                Time
1  hop    L trial1 594.4014 594.2836 219388 2020-06-05 13:26:39
2  hop    L trial1 594.4147 619.1613 219389 2020-06-05 13:26:39
3  hop    L trial1 594.4148 618.8364 219390 2020-06-05 13:26:39
4   vj    R trial2 594.4194 594.4196 211387 2020-06-05 13:26:39
5   vj    R trial2 594.3877 694.3853 211388 2020-06-05 13:26:39
6   vj    R trial2 618.8600 640.2640 211389 2020-06-05 13:26:39

Merging multiple dataframes by 3 common columns in R

2 Answers2