1

I need to merge two dataframes by two columns (participant and time) where time matches may be near, not exact (plus/minus 0.001). Below are two sample dataframes where the first three times are near matches and the others exact. Fuzzy_join in general doesn't work because these are large dataframes...

df1 <- data.frame("participant" = c("1", "1", "1", "1", "1", "1", "1", "1", 
                               "2", "2", "2", "2", "2", "2", "2", "2"), 
             "item" = c("a", "b", "c", "d", "e", "f", "g", "h",
                        "i", "j", "k", "l", "m", "n", "o", "p"),
             "time" = c("43.565", "54.125", "65.923", "73.858", "111.123", "143.124", "255.500", "255.502",
                        "300.595", "350.252", "400.600", "511.122", "525.887", "577.752", "599.129", "601.992"))

df2 <- data.frame("participant" = c("1", "1", "1", "1", "1", "1", "1", "1", 
                                   "2", "2", "2", "2", "2", "2", "2", "2"), 
                 "value" = c("xyz", "hlm", "test", "nop", "test", "nop", "hlm", "test",
                             "hlm", "test", "xyz", "xyz", "test", "xyz", "nop", "xyz"),
                 "time" = c("43.566", "54.124", "65.922", "73.858", "111.123", "143.124", "255.500", "255.502",
                            "300.595", "350.252", "400.600", "511.122", "525.887", "577.752", "599.129", "601.992"))
user9974638
  • 171
  • 8
  • In this specific case where they are near but not exact with +/- 0.001, would it be possible to simply join on a rounded value? –  Jan 27 '22 at 16:58
  • No, because say we round to two digits, then the following values get mismatched: 1.005 > 1.01 and 1.004 > 1.00. – user9974638 Jan 27 '22 at 19:13

2 Answers2

2

Why not use fuzzy_join?. Looks pretty fast, can't think of anything faster. I had to convert the character variable "time" to numeric before the fuzzy_join operation.

df1 <- df1 %>%
    readr::type_convert()
df2 <- df2%>%
    readr::type_convert()

library(fuzzyjoin)

fuzzy_join(df1, df2, by = "time",
           match_fun = ~ abs(.x - .y) < 0.002)

   participant.x item  time.x participant.y value  time.y
1              1    a  43.565             1   xyz  43.566
2              1    b  54.125             1   hlm  54.124
3              1    c  65.923             1  test  65.922
4              1    d  73.858             1   nop  73.858
5              1    e 111.123             1  test 111.123
6              1    f 143.124             1   nop 143.124
7              1    g 255.500             1   hlm 255.500
8              1    h 255.502             1  test 255.502
9              2    i 300.595             2   hlm 300.595
10             2    j 350.252             2  test 350.252
11             2    k 400.600             2   xyz 400.600
12             2    l 511.122             2   xyz 511.122
13             2    m 525.887             2  test 525.887
14             2    n 577.752             2   xyz 577.752
15             2    o 599.129             2   nop 599.129
16             2    p 601.992             2   xyz 601.992

EDIT

The OP asked for a function that matches multiple columns. If we want multiple pairs of columns, we can use e vector of columns to match, and a list of matching functions, as in:

fuzzy_join(df1, df2, by = c("participant", "time"),
           match_fun = list(`==`,
                            \(x,y) abs(x - y) < 0.002)
           )

   participant.x item  time.x participant.y value  time.y
1              1    a  43.565             1   xyz  43.566
2              1    b  54.125             1   hlm  54.124
3              1    c  65.923             1  test  65.922
4              1    d  73.858             1   nop  73.858
5              1    e 111.123             1  test 111.123
6              1    f 143.124             1   nop 143.124
7              1    g 255.500             1   hlm 255.500
8              1    h 255.502             1  test 255.502
9              2    i 300.595             2   hlm 300.595
10             2    j 350.252             2  test 350.252
11             2    k 400.600             2   xyz 400.600
12             2    l 511.122             2   xyz 511.122
13             2    m 525.887             2  test 525.887
14             2    n 577.752             2   xyz 577.752
15             2    o 599.129             2   nop 599.129
16             2    p 601.992             2   xyz 601.992
GuedesBF
  • 8,409
  • 5
  • 19
  • 37
  • Can you edit this to match by both participant and time? But the problem I've had before is that it runs out of memory. Perhaps a for loop, looping over participants? I'm bad at loops thought and haven't been able to solve it. – user9974638 Jan 27 '22 at 19:14
  • Please see my edited answer for multiple columns matching. – GuedesBF Jan 27 '22 at 19:29
  • I obviously can not reproduce the memory issues so I am not sure how much more can I help – GuedesBF Jan 27 '22 at 19:30
  • I appreciate your help. This would in principle work, but yeah, this dataset has 152,700 rows, so it comes up with "Error: vector memory exhausted (limit reached?)". Do you think looping the function by participant would work? Each participant only has maximally 1,500 rows. I'm not sure how to make the loop though. – user9974638 Jan 27 '22 at 19:40
  • 150k rows is definitely not too large a dataframe. I wonder why are you facing such issues. But I am not much of a big data expert. Maybe this helps: https://stackoverflow.com/questions/51295402/r-on-macos-error-vector-memory-exhausted-limit-reached – GuedesBF Jan 27 '22 at 20:11
0

This may be a less elegant solution, but hopefully meets your needs and also is understandable enough for you to adapt as needed.

df1$time <- as.numeric(df1$time)
df2$time <- as.numeric(df2$time)
df1$value <- NA

for(i in  1:nrow(df1)){
  df2_participant <- df2[df2$participant == df1$participant[i],] #subset df2 to only the rows with matching participant id
  for(j in 1:nrow(df2_participant)){
    match_cond <- abs(df2_participant$time[j] - df1$time[i]) < 0.002
    df1$value[i] <- ifelse(match_cond, df2_participant$value[j], NA)
    if(match_cond){
      break
    }
  }
}

This uses df1 as the "master" dataframe and adds any corresponding value from d2. Once it finds a matching participant/time, it moves on to the next row of df1.

EDIT/UPDATE

Using a combination of my previous suggestion and the other suggestion from GuedesBF, you might get the best of both worlds. And since you do have so many additional columns, I would try to whittle that down to only the necessary ones if you do have memory issues - you should be able to merge them back in to the result pretty easily.

library(fuzzyjoin)
library(tidyverse)
df1 <- df1 %>%
  readr::type_convert()
df2 <- df2%>%
  readr::type_convert()

participants <- unique(df1$participant)

result <- data.frame('participant.x'=NA,'item'=NA,'time.x'=NA,'participant.y'=NA,'value'=NA,'time.y'=NA)

for(i in 1:length(participants)){
  sub_df1 <- df1[df1$participant == participants[i],]
  sub_df2 <- df2[df2$participant == participants[i],]
  
  sub_result <- fuzzy_join(sub_df1, sub_df2, by = "time",
             match_fun = ~ abs(.x - .y) < 0.002, mode='left')
  
  result <- rbind(result, sub_result)
  
}

final_result <- result[-1,]
final_result
ErrorJordan
  • 611
  • 5
  • 15
  • Thanks for this, ErrorJordan. I'm sure it would work, but it's also been chugging away for hours all night and still not completed (on a mac with 15 GB RAM), and I don't understand what I should do then... This code only evaluates participant, time, and then attaches value, right? The actual dataframe I'm working with has 48 columns. Would it matter if I remove the other columns and then paste them back together later? Otherwise I'm beginning to think I need to make like 20 sub-df's splitting by groups of participants and run them all separately or something, I don't know... :-( – user9974638 Jan 28 '22 at 06:29
  • I've added an updated approach that may help with speed? I'm not an expert in optimizing speed, but this does get rid of the second for loop which should help. – ErrorJordan Jan 28 '22 at 14:58
  • This works great! But for error checking, could it be possible to insert NAs for unmatched rows? Now I seem to lose rows where I guess there isn't a match. Do I just take away na.omit() then? – user9974638 Jan 30 '22 at 19:20
  • It is hard to believe this approach would be any faster than going straight for the fuzzy join. It is just joining by participant with a for loop before the fuzzy join call, which ought to be slower. Have you checked with the actual data, @user9974638? – GuedesBF Jan 31 '22 at 12:31
  • Just published my final update. Switched the fuzzy_join to left join instead of an inner join and removed the first row (the initialized row of NAs). Since GuedesBF and I have answered the question originally asked, we should probably be done updating these answers. If you have more questions about it, you could ask a new question or probably find the answers elsewhere. Good luck! – ErrorJordan Jan 31 '22 at 16:56
  • I have tested on the actual data, and it's quite fast ErrorJordan's loop. Although I haven't tested whether this is because of the loop or simply dropping the many columns (but since there's a by-condition on the matching, I don't see why have 70 columns would matter since they shouldn't be evaluated anyway?). Thanks so much for the help! – user9974638 Feb 07 '22 at 14:39