With the following code I intend to match two files containing phone calls. One file would belong to the client and the other to the provider. The match is made by the number calling (ANUM), the number receiving the call (BNUM) and the moment in time when the call starts (SECONDS). The variable SECONDS represents a specific point in time measured in seconds. That is, seconds+minutes60+hour60*60, and may also include the day if several different days are analysed.
The variable SECONDS can differ between the customer and the supplier by a maximum of X, e.g. 5 seconds. That is, a call from a number A to a number B that the customer has at a point in time that differs from the provider by a maximum of 5 seconds should be considered the same call and matched.
The output files I want to obtain are:
Matcheted calls with the columns of the client file.
Matcheted calls with the columns of the supplier's file. The objective is to join this file with the previous one (paste it in excel one next to the other, for example) and once the match is done, compare the duration of the calls and see where there is a difference in duration. That is, this file 2 has to have the same dimension as file 1.
Calls that the client has but not the provider.
Calls that the provider has but the client does not have.
The code I am presenting must have an error somewhere because when I try with examples I don't get the same dimension in file 1 and 2.
library("readxl")
library("dplyr")
library("writexl")
datavendor <- read_excel("CDRS_VENDOR_1MARCH.xlsx")
dataclient <- read_excel("CDRS_CLIENT_1MARCH.xlsx")
# Perform type conversions
dataclient$ANUM <- as.numeric(dataclient$ANUM)
datavendor$ANUM <- as.numeric(datavendor$ANUM)
dataclient$BNUM <- as.numeric(dataclient$BNUM)
datavendor$BNUM <- as.numeric(datavendor$BNUM)
dataclient$SECONDS <- as.numeric(dataclient$SECONDS)
datavendor$SECONDS <- as.numeric(datavendor$SECONDS)
# Define the function to perform join and assignment operations
match_and_assign <- function(data1, data2, time_offset) {
for (i in 0:time_offset) {
temp_data2 <- data2 %>%
mutate(SECONDS = SECONDS + i)
ok_data <- semi_join(data1, temp_data2, by = c("ANUM", "BNUM", "SECONDS"))
anti_data1 <- anti_join(data1, temp_data2, by = c("ANUM", "BNUM", "SECONDS"))
data1 <- anti_data1
if (i == 0) {
match_data <- ok_data
} else {
match_data <- rbind(match_data, ok_data)
}
}
return(list(match_data = match_data, unmatched_data = data1))
}
# Define the maximum value of time deviation
max_time_offset <- 5
# Perform union and assignment operations with a definite function
result_vendor <- match_and_assign(datavendor, dataclient, max_time_offset)
result_client <- match_and_assign(dataclient, datavendor, max_time_offset)
# Accessing the desired results
okvendor <- result_vendor$match_data
okclient <- result_client$match_data
unmatched_vendor <- result_vendor$unmatched_data
unmatched_client <- result_client$unmatched_data
# Write the files
write_xlsx(okclient,"/Users/Downloads/Matched_client.xlsx")
write_xlsx(okvendor,"/Users/Downloads/Matched_vendor.xlsx")
write_xlsx(unmatched_client,"/Users/Downloads/Unmatched_client.xlsx")
write_xlsx(unmatched_vendor,"/Users/Downloads/Unmatched_vendor.xlsx")
dput(head(dataclient))
structure(list(`Event time` = structure(c(1677628826, 1677628847, 1677628909, 1677628880, 1677628849, 1677628874), tzone = "UTC", class = c("POSIXct", "POSIXt")), ANUM = c(20795883477, 22765981865, 21785847031, 22774133419, 22784534311, 25784234311), BNUM = c(26774521101, 20785899196, 22734269069, 22795950021, 28794493816, 22715594275), SECONDS = c(3244, 3265, 3327, 3298, 3267, 3292), HORA = c(0, 0, 0, 0, 0, 0), MINUTO = c(0, 0, 1, 1, 0, 1), SEGUNDO = c(26, 47, 49, 20, 49, 14)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
dput(head(datavendor))
structure(list(BNUM = c(21795985477, 21785961865, 22735847031, 22774137419, 25788234311, 22724234311), ANUM = c(22776521101, 22787899196, 23764269869, 21793650021, 22794443816, 22745594275), connecttime = c("2023-03-01 00:00:37.084+00", "2023-03-01 00:00:45.091+00", "2023-03-01 00:01:03.677+00", "2023-03-01 00:01:08.593+00", "2023-03-01 00:01:15.838+00", "2023-03-01 00:01:18.033+00"), SECONDS = c(37, 45, 63, 68, 75, 78), HORA = c("00", "00", "00", "00", "00", "00"), MINUTO = c("00", "00", "01", "01", "01", "01"), SEGUNDO = c("37", "45", "03", "08", "15", "18")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))