-1

From looking through Stackoverflow, and other sources, I believe that changing my dataframes to data.tables and using setkey, or similar, will give what I want. But as of yet I have been unable to get a working Syntax.

I have two data frames, one containing 26000 rows and the other containing 6410 rows.

The first dataframe contains the following columns:

Customer name, Base_Code, Idenity_Number, Financials

The second dataframe holds the following:

Customer name, Base_Code, Idenity_Number, Financials, Lapse

Both sets of data have identical formatting.

My goal is to join the Lapse column in the second dataframe to first dataframe. The issue I have is that the numeric value in Financials does not match between the two datasets and I only want the closest match in DF1 to have the value in the Lapse column in DF2 against it.

There will be examples where there are multiple entries for the same customer ID and Base Code in each dataframe, so I need to merge the two based on Idenity_Number and Base_Code (which is exact) and then match against the nearest financial numeric match for each entry only.

There will never be more entries in the DF2 then held within DF1 for each Customer and Base_Code.

Here is an example of DF1:

DF1

Here is an example of DF2:

DF2

And finally, here is what I want end up with:

Desired Outcome

If we use Jessica Rabbit as the example we have a match against DF1 and DF2, the financial value of 1240 from DF1 was matched against 1058 in DF2 as that was the closest match.

Frank
  • 66,179
  • 8
  • 96
  • 180
Saarek
  • 119
  • 2
  • 11
  • 3
    Maybe read `?data.table` re the `roll="nearest"` and `mult=` options. If your join conditions are sufficiently messy, maybe look into "fuzzy joins" (I'm not sure what R tools are best for those...). Without a concrete example illustrating the set of issues you're facing, it will be hard to give an answer that includes code. – Frank Dec 04 '18 at 15:47
  • @Frank Thanks for taking a look Frank, I'm new to R and this is the first time I have attempted something like this. I've added examples to my original post, sorry it would not let me paste it in but insisted that they had to be pictures. Hopefully it will give you any clarificaiton that you need. – Saarek Dec 05 '18 at 09:40
  • Please don't paste pictures of data. Please read https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and make it clear how you want to handle the case where two lapse flags are closest to the same record in df1. – C8H10N4O2 Dec 05 '18 at 18:39
  • @C8H10N4O2 Sorry, I tried to paste it as code or text, but stackoverflow insisted that it had to be a picture. I've read the guidelines you have sent and will look to follow the create dataframe rule in the future. – Saarek Dec 06 '18 at 08:58
  • It would be nice if anyone down voting could give a reason as to why they have down voted. If the answer is so simple that you feel the need to down vote how about supplying the answer to someone who is new to R. – Saarek Dec 06 '18 at 08:59

1 Answers1

0

I could not work out how to get a working solution using data.table, so I re-thought my approach and have come up with a solution.

First of all I merged the two datasets, and then removed any entries that did not have a stauts of "LAP", this gave me all of the NON Lapsed entries:

NON_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.x=TRUE)

NON_LAP <- NON_LAP [!grepl("LAP", NON_LAP$Status, ignore.case=FALSE),]

Next I merged again, this time looking specifically for the lapsed cases. To work out which was the cloest match I used the abs function, then I ordered by the lowest difference to get the closest matches in order. Finally I removed duplicates to show the closest matches and then also kept duplicates and stripped out the "LAP" status to ensure those that were not the closest match remained in the data.

Finally I merged them all together giving me the required outcome.

FIND_LAP  <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.y=FALSE)

FIND_LAP$Difference <- abs(FIND_LAP$GWP - FIND_LAP$ACTUAL_PRICE)

FIND_LAP  <- FIND_LAP[order( FIND_LAP[,27] ),]

FOUND_LAP <- FIND_LAP [!duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]

NOT_LAP <- FIND_LAP [duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]

Hopefully this will help someone else who might be new to R and encounters the same issue.

Saarek
  • 119
  • 2
  • 11