0

I have two tables: One with participants and one with an encoding of scores based on birth dates. The score table looks like this:

score_table

Key        |  Value
--------------------
01/01/1900 |  15
01/01/1940 |  25
01/01/1950 |  30

All participants with birth dates between 01/01/1900 and 01/01/1940 should get a score of 15. Participants born between 01/01/1940 and 01/01/1950 should get a score of 25, etc.

My participants' table looks like this:

participant_table

BirthDate    |   Gender
-----------------------
05/05/1930   |   M
02/07/1954   |   V
01/11/1941   |   U

I would like to add a score to get the output table:

BirthDate    |   Gender   |   Score
------------------------------------
05/05/1930   |   M        |   15
02/07/1954   |   V        |   30
01/11/1941   |   U        |   25

I built several solutions for similar problems when the exact values are in the score table (using dplyr::left_join or base::match) or for numbers which can be rounded to another value. Here, the intervals are irregular and the dates arbitrary.

I know I can build a solution by iterating through the score table, using this method:

as.Date("05/05/1930", format="%d/%m/%Y) < as.Date("01/01/1900", format="%d/%m/%Y)

Which returns a Boolean and thus allows me to walk through the scores until I find a date which is bigger and then use the last score. However, there must be a better way to do this.

Maybe I can create some sort of bins from the dataframe, as such:

Bin 1           | Bin 2           | Bin 3
Date 1 : Date 2 | Date 2 : Date 3 | Date 3 : inf

But I don't yet see how. Does anyone see an efficient way to create such bins from a dataframe, so that I can efficiently retrieve scores from this table?

MRE:

Score table:

structure(list(key=c("1/1/1900", "2/1/2013", "2/1/2014","2/1/2015", "4/1/2016", "4/1/2017"), value=c(65,65,67,67,67,68)), row.names=1:6, class="data.frame")

Participant File:

structure(list(birthDate=c("10/10/1968", "6/5/2015","10/10/2017"), Gender=c("M", "U", "F")), row.names=1:3, class="data.frame")

Goal File:

structure(list(birthDate=c("10/10/1968", "6/5/2015","10/10/2017"), Gender=c("M", "U", "F"), Score = c(65,67,68)), row.names=1:3, class="data.frame")

MilanV
  • 72
  • 6
  • It looks like the scores are associated with decades?! Can't we just add a new variable, denoting the decade (e.g 1930, 1940) and use that to join the two data sets? – dario Mar 09 '20 at 15:59
  • If you add a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) I'll give it a try. (A MRE makes it easier for others to find and test a answer to your question) – dario Mar 09 '20 at 16:00
  • I added a simple MRE – MilanV Mar 13 '20 at 10:42

2 Answers2

0

Here is an approach using lag() along with sqldf:

score_table$Key2 <- as.Date(lead(score_table$Key), format="%d/%m/%Y")
score_table$Key <- as.Date(score_table$Key, format="%d/%m/%Y")
names(score_table) <- c("Date1", "Value", "Date2")
participant_table$BirthDate <- as.Date(participant_table$BirthDate, format="%d/%m/%Y")
sql <- "SELECT p.BirthDate, p.Gender, s.Value AS Score
        FROM participant_table p
        INNER JOIN score_table s
            ON (p.BirthDate >= s.Date1 OR s.Date1 IS NULL) AND
               (p.BirthDate < s.Date2 OR s.Date2 IS NULL)"
participant_table <- sqldf(sql)

The logic here is to join the participant to the score table using a range of matching dates in the latter. For the edge cases of the first and last rows of the score table, we allow a missing date in either column to represent any date whatsoever. For example, in the last row of the score table, the only requirement for a match is that a date be greater than the lower portion of the range.

I actually do not have R running locally at the moment, but here is a demo link to SQLite showing that the SQL logic works correctly:

screen capture of demo link below

Demo

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Consider defining `Key2` in the SQL steatement: `participant_table$BirthDate <- as.Date(participant_table$BirthDate, format = "%d/%m/%Y"); score_table$Key <- as.Date(score_table$Key, format = "%d/%m/%Y"); sql <- "with s2 as ( select *, lag(Key, -1) over() as Key2 from score_table ) SELECT p.BirthDate, p.Gender, s.Value AS Score FROM participant_table p INNER JOIN s2 s ON (p.BirthDate >= s.Key OR s.Key IS NULL) AND (p.BirthDate < s.Key2 OR s.Key2 IS NULL)"; sqldf(sql)` – G. Grothendieck Mar 11 '20 at 12:19
  • The Logic is exactly what I am looking for! However, The Minicran Snapshot which I have to use does not include sqldf... I am going to try and get that organised though, as I don't see a clear R-way of solving the problem. – MilanV Mar 13 '20 at 11:39
0

I have found a very simple solution using only arithmetic.

In order to retrieve a score, I check how many numbers are superseded by the input date:

rownum <- sum(as.Date(input_date, format="%d/%m/%Y") > 
              as.Date(score_table$Key, format="%d/%m/%Y"))

Then, the corresponding key can be found using:

score <- score_table[["Value"]][rownum]

Thus, the spacing of the dates becomes irrelevant and it works quite fast. I thought I'd share my solution in case it might be of use. Thanks everyone for the effort and responses!

MilanV
  • 72
  • 6