Add Column Based on Matching Multiple Conditionals and Date Ranges from Multiple Data Sets

Question

I've been struggling with trying to find the best way of solving this problem.

To generalize this problem and help others who may find themselves needing to perform similar tasks, I am trying to find the best way to add columns to one data set from third, that is based on matching in an intermediary data set, AND belonging in a date range of a third dataset. The end result would be returning matching values from the third data set into the first.

Here are the heads of sample data frames to add a bit of clarity:

> head(SalesData, 10)
   sale_id sale_amt int_rate  sale_date sale_status
1        1     7000    10.71 2008-05-01  Fully Paid
2        2    10800    13.57 2009-11-01  Fully Paid
3        3     7500    10.08 2008-04-01  Fully Paid
4        4     3000    14.26 2009-09-01  Fully Paid
5        5     5600    14.96 2010-02-01 Charged Off
6        6     2800    11.49 2010-08-01  Fully Paid
7        7    10000     8.59 2009-10-01  Fully Paid
8        8    18000    10.39 2008-03-01  Fully Paid
9        9     5000    15.13 2008-04-01  Fully Paid
10      10     9600    12.29 2008-03-01  Fully Paid

> head(EmployeeSales, 10)
   sale_id empl_name empl_num
1        1    Dakota        4
2        2    Dakota        4
3        3      Kami        9
4        4      Adel        1
5        5      Adel        1
6        6     Farah        6
7        7      Kami        9
8        8      Kami        9
9        9       Ida        7
10      10      Kami        9

> head(EmployeeMap, 10)
   empl_num empl_name skill_lvl team start_date   end_date
1         1      Adel       Beg  Red 2007-06-01 2008-05-31
2         1      Adel       Int  Red 2008-06-01 2010-10-31
3         1      Adel       Adv  Red 2010-11-01 2999-12-12
4         2    Bailey       Beg Blue 2010-08-01 2011-04-30
5         2    Bailey       Beg  Red 2011-05-01 2999-12-12
6         3     Casey       Beg Blue 2010-08-01 2010-12-31
7         3     Casey       Int Blue 2011-01-01 2999-12-12
8         4    Dakota       Beg  Red 2007-06-01 2009-08-30
9         4    Dakota       Int  Red 2009-09-01 2010-08-30
10        4    Dakota       Adv  Red 2010-09-01 2011-08-30

The desired output would add the empl_num, sales_team and skill_level from EmployeeMap to the SalesData for every sale_id.

In trying to conceptualize the steps, this is what I am thinking, but perhaps there is a better way: Take the sale_id from SalesData, match it to the sale_id in Employee Sales and get the empl_num. Take the empl_num and match it to empl_num in Employee Map. Now we need to take the sale_date from SalesData and find which range of "start_date, end_date" it falls into. Then we would take the team and skill level that matches, and add that to the SalesData.

See table below:

 > head(df2,10)
    sale_id sale_amt int_rate  sale_date sale_status empl_num  team skill_lvl
 1        1     7000    10.71 2008-05-01  Fully Paid        4   Red       Beg
 2        2    10800    13.57 2009-11-01  Fully Paid        4   Red       Int
 3        3     7500    10.08 2008-04-01  Fully Paid        9  Blue       Beg
 4        4     3000    14.26 2009-09-01  Fully Paid        1   Red       Int
 5        5     5600    14.96 2010-02-01 Charged Off        1   Red       Int
 6        6     2800    11.49 2010-08-01  Fully Paid        6   Red       Beg
 7        7    10000     8.59 2009-10-01  Fully Paid        9  Blue       Int
 8        8    18000    10.39 2008-03-01  Fully Paid        9  Blue       Beg
 9        9     5000    15.13 2008-04-01  Fully Paid        7  Blue       Beg
 10      10     9600    12.29 2008-03-01  Fully Paid        9  Blue       Int

What is complicating this for me is that in the EmployeeMap, the start_date and end_date tell us the date that each employee started and ended belonging to a particular skill level and team. But each employee has changed skill level and/or team so every employee has multiple rows.

For example, in EmployeeMap for empl_id 1, we can see 3 rows telling us their start_date and end_date while they had a skill_level Beg, Int, Adv all on Red Team. But some, like empl_id 2 change team while staying in the same skill level. And others change skill level and team.

I would appreciate any insight you may have into the best way to solve this problem.

score 1 · Answer 1 · answered May 06 '18 at 00:49

Perhaps the easiest way to accomplish this is with two SQL-like joins (I suggest you give something like this a read if you're not familiar with joins/relational algebra).

Many joins can be performed with the merge function in base R, and many other popular packages (dplyr, data.table, sqldf, to name a few) offer alternative syntax or extended functionality in join operations.

The first of your two joins (between SalesData and EmployeeSales) can easily be accomplished with merge:

merge(SalesData, EmployeeSales, by = "sale_id")

#    sale_id sale_amt int_rate  sale_date sale_status empl_name empl_num
# 1        1     7000    10.71 2008-05-01  Fully Paid    Dakota        4
# 2        2    10800    13.57 2009-11-01  Fully Paid    Dakota        4
# 3        3     7500    10.08 2008-04-01  Fully Paid      Kami        9
# ...

The second join, however, is more complicated, since it's not a typical equi-join. Instead, the join logic needs to find rows in EmployeeMap where start_date is less than sale_date and end date is greater than it (in addition to the equality condition on empl_num).

Fortunately, the aforementioned data.table package provides the ability to apply said logic.

library(data.table)

# convert all three dataframes to data.table objects
setDT(SalesData) ; setDT(EmployeeSales) ; setDT(EmployeeMap)

EmployeeMap[SalesData[EmployeeSales[, c("sale_id","empl_num")],
                      on = "sale_id"], 
            on = .(empl_num, start_date <= sale_date, end_date >= sale_date)]

#    empl_num empl_name skill_lvl team start_date   end_date sale_id sale_amt int_rate sale_status
# 1:        4    Dakota       Beg  Red 2008-05-01 2008-05-01       1     7000    10.71  Fully Paid
# 2:        4    Dakota       Int  Red 2009-11-01 2009-11-01       2    10800    13.57  Fully Paid
# 3:        9        NA        NA   NA 2008-04-01 2008-04-01       3     7500    10.08  Fully Paid
# ...

Note that all three date columns should be Date types, and not strings, for the comparison to work. Also note that the NA values in the output above are a result of the snapshot of EmployeeMap provided in the question, which only maps empl_num 1-4.

I would also suggest reading the answers from this question for more context on how to join on a date range.

score 0 · Accepted Answer · answered May 06 '18 at 01:35

Consider running merge twice then subset by dates. Below nests the calls together in a long one-liner but can be broken apart in separate lines. Also, output is smaller than your desired result since your posted data are sample rows.

# MERGE TWICE AND SUBSET BY DATE
finaldf <- subset(merge(merge(SalesData, EmployeeSales, by="sale_id"), 
                        EmployeeMap, "empl_num", suffixes=c('', '_')),
                  sale_date >= start_date & sale_date <= end_date)

# SELECT NEEDED COLUMNS
finaldf <- finaldf[c("sale_id", "sale_amt", "int_rate", "sale_date", 
                     "sale_status", "empl_num", "team", "skill_lvl")]

# RE-ORDER BY SALE_ID AND RESET ROW NAMES
finaldf <- with(finaldf, finaldf[order(sale_id),])
row.names(finaldf) <- NULL

finaldf
#   sale_id sale_amt int_rate  sale_date sale_status empl_num team skill_lvl
# 1       1     7000    10.71 2008-05-01  Fully Paid        4  Red       Beg
# 2       2    10800    13.57 2009-11-01  Fully Paid        4  Red       Int
# 3       4     3000    14.26 2009-09-01  Fully Paid        1  Red       Int
# 4       5     5600    14.96 2010-02-01 Charged Off        1  Red       Int

G. Grothendieck · Answer 3 · 2018-05-06T12:19:04.527

In SQL terms this is a 3-way join. It can be done in a single SQL select like this:

library(sqldf)

sqldf("
  select s.*, es.empl_num, em.team, em.skill_lvl
  from SalesData s
  left join EmployeeSales es 
    using (sale_id)
  left join EmployeeMap em
    on es.empl_num = em.empl_num and s.sale_date between em.start_date and em.end_date
")

Using the data in the Note at the end (based on data shown in question) we get the following. Only the first 4 employee numbers exist in the EmployeeMap data shown in the question and the left joins ensure we get NA values for the team and skill level of the others instead of those SalesData rows being dropped due to non-matching.

   sale_id sale_amt int_rate  sale_date sale_status empl_num team skill_lvl
1        1     7000    10.71 2008-05-01  Fully Paid        4  Red       Beg
2        2    10800    13.57 2009-11-01  Fully Paid        4  Red       Int
3        3     7500    10.08 2008-04-01  Fully Paid        9 <NA>      <NA>
4        4     3000    14.26 2009-09-01  Fully Paid        1  Red       Int
5        5     5600    14.96 2010-02-01 Charged Off        1  Red       Int
6        6     2800    11.49 2010-08-01  Fully Paid        6 <NA>      <NA>
7        7    10000     8.59 2009-10-01  Fully Paid        9 <NA>      <NA>
8        8    18000    10.39 2008-03-01  Fully Paid        9 <NA>      <NA>
9        9     5000    15.13 2008-04-01  Fully Paid        7 <NA>      <NA>
10      10     9600    12.29 2008-03-01  Fully Paid        9 <NA>      <NA>

Note

Input data in reproducible form:

SalesData <- structure(list(sale_id = 1:10, sale_amt = c(7000L, 10800L, 7500L, 
3000L, 5600L, 2800L, 10000L, 18000L, 5000L, 9600L), int_rate = c(10.71, 
13.57, 10.08, 14.26, 14.96, 11.49, 8.59, 10.39, 15.13, 12.29), 
    sale_date = structure(c(3L, 6L, 2L, 4L, 7L, 8L, 5L, 1L, 2L, 
    1L), .Label = c("2008-03-01", "2008-04-01", "2008-05-01", 
    "2009-09-01", "2009-10-01", "2009-11-01", "2010-02-01", "2010-08-01"
    ), class = "factor"), sale_status = structure(c(2L, 2L, 2L, 
    2L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("Charged Off", "Fully Paid"
    ), class = "factor")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

EmployeeSales <-
structure(list(sale_id = 1:10, empl_name = structure(c(2L, 2L, 
5L, 1L, 1L, 3L, 5L, 5L, 4L, 5L), .Label = c("Adel", "Dakota", 
"Farah", "Ida", "Kami"), class = "factor"), empl_num = c(4L, 
4L, 9L, 1L, 1L, 6L, 9L, 9L, 7L, 9L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

EmployeeMap <- structure(list(empl_num = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 
4L), empl_name = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 
4L, 4L), .Label = c("Adel", "Bailey", "Casey", "Dakota"), class = "factor"), 
    skill_lvl = structure(c(2L, 3L, 1L, 2L, 2L, 2L, 3L, 2L, 3L, 
    1L), .Label = c("Adv", "Beg", "Int"), class = "factor"), 
    team = structure(c(2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L
    ), .Label = c("Blue", "Red"), class = "factor"), start_date = structure(c(1L, 
    2L, 6L, 4L, 8L, 4L, 7L, 1L, 3L, 5L), .Label = c("2007-06-01", 
    "2008-06-01", "2009-09-01", "2010-08-01", "2010-09-01", "2010-11-01", 
    "2011-01-01", "2011-05-01"), class = "factor"), end_date = structure(c(1L, 
    4L, 8L, 6L, 8L, 5L, 8L, 2L, 3L, 7L), .Label = c("2008-05-31", 
    "2009-08-30", "2010-08-30", "2010-10-31", "2010-12-31", "2011-04-30", 
    "2011-08-30", "2999-12-12"), class = "factor")), class = "data.frame", 
    row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

Add Column Based on Matching Multiple Conditionals and Date Ranges from Multiple Data Sets

3 Answers3

Note

Linked