Inexact matching by number between two data frames in R

Question

With the following code I intend to match two files containing phone calls. One file would belong to the client and the other to the provider. The match is made by the number calling (ANUM), the number receiving the call (BNUM) and the moment in time when the call starts (SECONDS). The variable SECONDS represents a specific point in time measured in seconds. That is, seconds+minutes60+hour60*60, and may also include the day if several different days are analysed.

The variable SECONDS can differ between the customer and the supplier by a maximum of X, e.g. 5 seconds. That is, a call from a number A to a number B that the customer has at a point in time that differs from the provider by a maximum of 5 seconds should be considered the same call and matched.

The output files I want to obtain are:

Matcheted calls with the columns of the client file.
Matcheted calls with the columns of the supplier's file. The objective is to join this file with the previous one (paste it in excel one next to the other, for example) and once the match is done, compare the duration of the calls and see where there is a difference in duration. That is, this file 2 has to have the same dimension as file 1.
Calls that the client has but not the provider.
Calls that the provider has but the client does not have.

The code I am presenting must have an error somewhere because when I try with examples I don't get the same dimension in file 1 and 2.

library("readxl")
library("dplyr")
library("writexl")

datavendor <- read_excel("CDRS_VENDOR_1MARCH.xlsx")
dataclient <- read_excel("CDRS_CLIENT_1MARCH.xlsx")


# Perform type conversions
dataclient$ANUM <- as.numeric(dataclient$ANUM)
datavendor$ANUM <- as.numeric(datavendor$ANUM)

dataclient$BNUM <- as.numeric(dataclient$BNUM)
datavendor$BNUM <- as.numeric(datavendor$BNUM)

dataclient$SECONDS <- as.numeric(dataclient$SECONDS)
datavendor$SECONDS <- as.numeric(datavendor$SECONDS)


# Define the function to perform join and assignment operations
match_and_assign <- function(data1, data2, time_offset) {
  for (i in 0:time_offset) {
    temp_data2 <- data2 %>%
      mutate(SECONDS = SECONDS + i)

    ok_data <- semi_join(data1, temp_data2, by = c("ANUM", "BNUM", "SECONDS"))
    anti_data1 <- anti_join(data1, temp_data2, by = c("ANUM", "BNUM", "SECONDS"))

    data1 <- anti_data1
    if (i == 0) {
      match_data <- ok_data
    } else {
      match_data <- rbind(match_data, ok_data)
    }
  }

  return(list(match_data = match_data, unmatched_data = data1))
}

# Define the maximum value of time deviation
max_time_offset <- 5

# Perform union and assignment operations with a definite function
result_vendor <- match_and_assign(datavendor, dataclient, max_time_offset)
result_client <- match_and_assign(dataclient, datavendor, max_time_offset)

# Accessing the desired results
okvendor <- result_vendor$match_data
okclient <- result_client$match_data
unmatched_vendor <- result_vendor$unmatched_data
unmatched_client <- result_client$unmatched_data

# Write the files
write_xlsx(okclient,"/Users/Downloads/Matched_client.xlsx")
write_xlsx(okvendor,"/Users/Downloads/Matched_vendor.xlsx")
write_xlsx(unmatched_client,"/Users/Downloads/Unmatched_client.xlsx")
write_xlsx(unmatched_vendor,"/Users/Downloads/Unmatched_vendor.xlsx")




dput(head(dataclient))
structure(list(`Event time` = structure(c(1677628826, 1677628847, 1677628909, 1677628880, 1677628849, 1677628874), tzone = "UTC", class = c("POSIXct", "POSIXt")), ANUM = c(20795883477, 22765981865, 21785847031, 22774133419, 22784534311, 25784234311), BNUM = c(26774521101, 20785899196, 22734269069, 22795950021, 28794493816, 22715594275), SECONDS = c(3244, 3265, 3327, 3298, 3267, 3292), HORA = c(0, 0, 0, 0, 0, 0), MINUTO = c(0, 0, 1, 1, 0, 1), SEGUNDO = c(26, 47, 49, 20, 49, 14)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))



dput(head(datavendor))
structure(list(BNUM = c(21795985477, 21785961865, 22735847031, 22774137419, 25788234311, 22724234311), ANUM = c(22776521101, 22787899196, 23764269869, 21793650021, 22794443816, 22745594275), connecttime = c("2023-03-01 00:00:37.084+00", "2023-03-01 00:00:45.091+00", "2023-03-01 00:01:03.677+00", "2023-03-01 00:01:08.593+00", "2023-03-01 00:01:15.838+00", "2023-03-01 00:01:18.033+00"), SECONDS = c(37, 45, 63, 68, 75, 78), HORA = c("00", "00", "00", "00", "00", "00"), MINUTO = c("00", "00", "01", "01", "01", "01"), SEGUNDO = c("37", "45", "03", "08", "15", "18")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

Without example data, you are calling on us to be very imaginative and perceptive about your data. Could you run `dput(head(dataclient))` and `dput(head(datavendor))` and put the output into your question? Feel free to edit the content (but please preserve the syntax) if you need to anonymize. — Jon Spring, Aug 22 '23 at 14:52
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. It's much harder to help debug code we can't actually run. — MrFlick, Aug 22 '23 at 14:55
I imagine this can be done more simply with a non-equi join, which has been part of dplyr since v1.1.0. Something like `dataclient |> mutate(SECONDS_high = SECONDS + 5) |> left_join(datavendor, join_by(ANUM, BNUM, SECONDS <= SECONDS, SECONDS_high >= SECONDS))` — Jon Spring, Aug 22 '23 at 14:56
Hi Jon. Thank you very much for your response. I have put the output of dput into my question — CLARA, Aug 22 '23 at 16:09
@CLARA you may want to look at the data again. there's no overlap between ANUM and BNUM in the two datasets, and also SECONDS in the dataclient df appears to be wrong — Mark, Aug 22 '23 at 16:26

score 0 · Answer 1 · answered Aug 22 '23 at 16:37

This feels like a really good question for the fuzzyjoin package, which does exactly the kind of inexact merging you're interested in performing. With some demo data (as noted in the comments, the example data wasn't comprehensive enough to demonstrate the joins):

library(fuzzyjoin)
dataclient <- data.frame(SECONDS=(1:15)*10, 
                         ANUM=trunc(runif(15)*1e9), 
                         BNUM=trunc(runif(15)*1e9))
datavendor <- data.frame(SECONDS=round((0:10)*10+runif(11)*5), 
                         ANUM=trunc(runif(11)*1e9), 
                         BNUM=trunc(runif(11)*1e9))

difference_full_join(dataclient, datavendor, by="SECONDS", max_dist=5)

where we've set up the SECONDS columns to have some variance of about 5 but then merge using a max_dist of 5 so the connections are recovered:

   SECONDS.x    ANUM.x    BNUM.x SECONDS.y    ANUM.y    BNUM.y
1         10 800645539 634349609        13 193615068 900241998
2         20 995521628 722356639        25 785895029 360213103
3         30 114639543 797989587        25 785895029 360213103
4         30 114639543 797989587        35 237932417 214477707
5         40 667134090 312174350        35 237932417 214477707
6         40 667134090 312174350        45 195937151 966350245
7         50 405507121 995470385        45 195937151 966350245
8         50 405507121 995470385        53  83245561 373993374
9         60 901356423 243821657        63 293882356 178900313
10        70   1462034 824282414        72 857817749  10435715
11        80 272650450 845019055        82 917233256 214198129
12        90 461754051  36351310        92 592704539 518927375
13       100  15183958 886851646       101 438603267 884772811
14       110 490163355 483237234        NA        NA        NA
15       120 743737922 537729894        NA        NA        NA
16       130 625128402 294578106        NA        NA        NA
17       140 608192171 635717906        NA        NA        NA
18       150 262778125 993660240        NA        NA        NA
19        NA        NA        NA         4 872109025 453920802

In this case, calls that the client has but the vendor does not will appear as NA in the .y columns, while calls that the vendor has but the client does not will be NA in the .x columns.

Inexact matching by number between two data frames in R

1 Answers1