0

Note: I have working code, just wondering if there are better methods to implement. My goal to to try to minimize run time if possible. Code is in R currently, would swapping to Python or C++ help speed things up in this case?

I have two tables: one is a table of payments and the other a table of contracts. Each table, at minimum, shares: account numbers, company codes, start dates and end dates (payment period dates in the payment table and contract period dates in the contract table). My objective is to create a third table of payments that cannot be matched/allocated to a contract. A payment matches/allocates to a contract if all three of the following conditions are true:

  1. Both payment and contract have matching account number
  2. Both payment and contract have matching company codes
  3. Payment start date falls within contract start and end dates

Payment Table

ID AccountNumber CompanyCode StartDate EndDate
1 123 35 1Jan22 1Feb22
2 123 37 1Mar19 1May19
3 123 35 1Apr23 1Jun23
4 123 35 1Mar22 1Apr22

Contract Table

ID Account Number Company Code StartDate EndDate
1 123 35 1Dec21 1Dec22
2 123 37 1Dec18 1Dec19

Sample data for the purpose of finding answers:

# Create Payment Table
PaymentTable <- data.frame(
  ID = c(1, 2, 3, 4),
  AccountNumber = c(123, 123, 123, 123),
  CompanyCode = c(35, 37, 35, 35),
  StartDate = dmy(c("1Jan22", "1Mar19", "1Apr23", "1Mar22")),
  EndDate = dmy(c("1Feb22", "1May19", "1Jun23", "1Apr22"))
)

# Create Contract Table
ContractTable <- data.frame(
  ID = c(1, 2),
  AccountNumber = c(123, 123),
  CompanyCode = c(35, 37),
  StartDate = dmy(c("1Dec21", "1Dec18")),
  EndDate = dmy(c("1Dec22", "1Dec19"))
)

In the above example data set, Payment IDs 1 and 4 match to Contract ID 1. Payment ID 3 matches to Contract ID 2. Payment ID 3 does not match to any contract.

So the conditions are:

  1. Check if for any ID of the payment dates are in between contract ID's dates
  2. Check if IDs, AccountNumber, and CompanyCode and are matching in both the tables

When above two conditions are met, merge the table.

I have simple working R code via what I call the naïve method of nested for loops. The outer for loop iterating thru every line of payment and the inner for loop iterating thru every contract. It takes about 10ish minutes for the code to complete 1/5 of all the lines of payments I have. Are there any cleaver/advanced algorithms that can be applied in this situation to improve computation speed?

Manoj Kumar
  • 5,273
  • 1
  • 26
  • 33
Choy
  • 13
  • 2
  • 2
    Sounds like this could be achieved using `merge` or `dplyr` joins, followed by a filter for the start date condition. But it will be easier to help if you [make this question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including small representative datasets and example output in a plain text format, as well as all the relevant code. – neilfws Apr 13 '23 at 04:23
  • 1
    Sounds like you want an overlap join. See `?dplyr::join_by`. – Ritchie Sacramento Apr 13 '23 at 04:24
  • Also see https://stackoverflow.com/q/1299871/17303805 for an overview of merge operations in R. – zephryl Apr 13 '23 at 04:29
  • I think people who closed this post did not understand the exact requirements... Choy asked for certain conditions to be checked before merging the two tables.. – Manoj Kumar Apr 13 '23 at 05:30
  • @ManojKumar - I understood the requirements. They are met via the join specifications of an overlap join. No need to 'pre-check' the data. – Ritchie Sacramento Apr 13 '23 at 05:55
  • `PaymentTable %>% inner_join(ContractTable, join_by(ID, AccountNumber, CompanyCode, StartDate >= StartDate, StartDate <= EndDate))` – Jon Spring Apr 13 '23 at 07:06

0 Answers0