0

I have two data tables in R; Table 1 is smaller and has about 4,000 rows and 4 cols. These cols are ID, Report_Date, Start_date, and End_date. The other table, Table 2, is a larger table which has thousands of rows and 8 cols. Both tables have the ID col but this ID col can have multiple rows in both tables. There is a date column in Table 2 as well and a col called "Value". The rest of the columns in Table 2 are not of use for this task.

For each row in Table 1, I need to calculate standard deviation of the col "Value" from Table 2 based on 2 conditions; 1. ID has to match, 2. The date range is within the interval defined by Start_Date and End_Date in Table 1. How can I do this in R? I have attached a small sample of both tables below. In the example below, I would need the standard deviation of Value for A1 for the date range in Start_Date and End_Date so STD_Value(A1, report date of 2008/10/02)= sd(10,11,11).

Sample Table 1 looks like this:

ID Report_Date Start_Date End_Date
A1 2008/10/02 2008/09/27 2008/09/30
A1 2008/11/02 2008/10/27 2008/10/30
A2 2008/02/02 2008/01/15 2008/01/17

Sample Table 2 looks like this:

ID Date Value
A1 2008/09/27 10
A1 2008/09/28 11
A1 2008/09/30 11
A1 2008/10/01 12
A1 2008/10/02 13
A2 2008/01/14 5
A2 2008/01/15 4
A2 2008/01/16 3
A2 2008/01/17 5
A2 2008/01/18 5
  • Perhaps you might want to use `sqldf`, `data.table`, or `fuzzy_join` packages and merge the two tables, then calculate `sd` by group based on `ID` and `Report_Date`. You can see some examples of merging based on ranges of dates [here](https://stackoverflow.com/questions/49079412/r-merge-by-id-and-date-between-two-dates) or [here](https://stackoverflow.com/questions/23958975/join-tables-by-date-range). – Ben Aug 20 '21 at 17:59

1 Answers1

0

Loading required packages

library(dplyr)
library(tidyverse)
library(fuzzyjoin)

Loading our two tables

first.table <- read.table('Table_1.csv', header = TRUE,  sep = ',',  stringsAsFactors = FALSE)
second.table <- read.table('Table_2.csv', header = TRUE,  sep = ',',  stringsAsFactors = FALSE)

Formatting the dates to work with

first.table$Start_Date <- as.Date(first.table$Start_Date, "%m/%d/%y")
first.table$End_Date <- as.Date(first.table$End_Date, "%m/%d/%y")
first.table$Report_Date <- as.Date(first.table$Report_Date, "%m/%d/%y")
second.table$Date <- as.Date(second.table$Date, "%m/%d/%y")

Fuzzy merging the two tables since inexact columns exist

Source: https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf

merged.table = fuzzy_left_join(
first.table, second.table,
by = c(
    "ID" = "ID",
    "Start_Date" = "Date",
    "End_Date" = "Date"
),
match_fun = list(`==`, `<=`, `>=`))

Outputting the result

final.table = merged.table %>% group_by(ID.y) %>% mutate(Standard_Deviation = sd(Value, na.rm = TRUE))
write_csv(final.table, "final_table.csv")

Sample output:

See csv screenshot sample here