Approach
Here's a solution with data.table
, as preferred:
I would prefer a solution with data.table but any solutions at all are much appreciated!
While dplyr
and fuzzyjoin
might appear more elegant, they might also prove less efficient with sufficiently large datasets.
Credit goes to ThomasIsCoding for beating me to the punch on this other question, with an answer that harnesses igraph
to index networks in graphs. Here, the networks are the separate "chains" (Wanted
groups) comprised of "links" (data.frame
rows), which are joined by their "closeness" (between their Start_Date
s and End_Date
s). Such an approach seemed necessary to model the transitive relationship ℛ requested here
I am trying to create the chain of "close" links so that I can map A's movements over time.
with care to also preserve the symmetry of ℛ (see Further Reading).
Per that same request
So I would ideally like to flag situations where one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31) and vice versa.
and your further clarification
...I would want another column that indicates that [flag].
I have also included a Flag
column, to flag each row whose Start_Date
is matched by the End_Date
s of at least flag_at
other rows; or vice versa.
Solution
Using your sample data.frame
, reproduced here as my_data_frame
# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
class = "Date"),
End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
class = "Date")),
row.names = c(NA, -7L),
class = "data.frame")
we apply data.table
and igraph
(among other packages) as follows:
library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)
# ...
# Code to generate your data.frame 'my_data_frame'.
# ...
# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()
# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)
# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2
#####################################
# BEGIN: Start Indexing the Groups. #
#####################################
# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
########################################################
# STEP 1: Link the Rows That Are "Close" to Each Other #
########################################################
# Prepare data.table for JOIN, by adding appropriate helper columns.
.[, `:=`(# Uniquely identify each row (by row number).
ID = .I,
# Boundary columns for tolerance threshold.
End_Low = End_Date - tolerance,
End_High = End_Date + tolerance)] %>%
# JOIN rows to each other, to obtain pairings.
.[my_data_table,
# Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
# close enough to (within the boundary columns for) the 'End_Date' of y.
.(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
# JOIN criteria:
on = .(# Only pair rows having the same name.
Name,
# Only pair rows whose start and end dates are within the tolerance
# threshold of each other.
End_Low <= Start_Date,
End_High >= Start_Date),
# Make it an OUTER JOIN, to include those rows without a match.
nomatch = NA] %>%
# Prepare pairings for network analysis.
.[# Ensure no row is reflexively paired with itself.
# NOTE: This keeps the graph clean by trimming extraneous loops, and it
# prevents an "orphan" row from contributing to its own tally of matches.
!(x.ID == y.ID) %in% TRUE,
# !(x.ID == y.ID) %in% TRUE,
# Simplify the dataset to only the pairings (by ID) of linked rows.
.(from = x.ID, to = y.ID)]
#############################
# PAUSE: Count the Matches. #
#############################
# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
# Include again the missing IDs for y that were never matched by the JOIN.
.[my_data_table[, .(ID)], on = .(to = ID)] %>%
# For each row y, count every other row x where x R y.
.[, .(Matches = sum(!is.na(from))), by = to] %>%
# Extract the count column.
.$Matches
# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
# For each row x, count every other row y where x R y.
.[, .(Matches = sum(!is.na(to))), by = from] %>%
# Extract the count column.
.$Matches
#########################################
# RESUME: Continue Indexing the Groups. #
#########################################
# Resume indexing:
output <- output %>%
# Ignore nonmatches (NAs) which are annoying to process into a graph.
.[from != to, ] %>%
###############################################################
# STEP 2: Index the Separate "Chains" Formed By Those "Links" #
###############################################################
# Convert pairings (by ID) of linked rows into an undirected graph.
igraph::graph_from_data_frame(directed = FALSE) %>%
# Find all groups (subgraphs) of transitively linked IDs.
igraph::components() %>%
# Pair each ID with its group index.
igraph::membership() %>%
# Tabulate those pairings...
utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
# ...in a properly named data.table.
data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
#####################################################
# STEP 3: Match the Original Rows to their "Chains" #
#####################################################
# LEFT JOIN (on ID) to match each original row to its group index (if any).
.[my_data_table, on = .(ID)] %>%
# Transform output into final form.
.[# Sort into original order.
order(ID),
.(# Select existing columns.
Name, Start_Date, End_Date,
# Rename column having the group indices.
Wanted = Group_Index,
# Calculate column(s) to flag rows with sufficient matches.
Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]
# View results.
output
Result
The resulting output
is the following data.table
:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
Keep in mind that the Flag
s are all FALSE
simply because your data lacks any Start_Date
matched by (at least) two End_Date
s; along with any End_Date
matched by (at least) two Start_Date
s.
Hypothetically, if we lowered flag_at
to 1
, then the output
would Flag
every row with even a single match (in either direction):
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 TRUE
4: A 2012-04-16 2015-01-09 1 TRUE
5: A 2002-06-01 2006-02-01 2 TRUE
6: A 2005-12-24 <NA> 2 TRUE
7: B 2018-01-23 <NA> NA FALSE
Warning
Because some data.table
operations modify by reference (or "in-place"), the value of my_data_table
changes throughout the workflow. After Step 1, my_data_table
becomes
Name Start_Date End_Date ID End_Low End_High
1: A 2015-01-01 2019-12-29 1 2018-12-29 2020-12-29
2: A 2017-03-25 <NA> 2 <NA> <NA>
3: A 2019-10-17 <NA> 3 <NA> <NA>
4: A 2012-04-16 2015-01-09 4 2014-01-09 2016-01-09
5: A 2002-06-01 2006-02-01 5 2005-02-01 2007-02-01
6: A 2005-12-24 <NA> 6 <NA> <NA>
7: B 2018-01-23 <NA> 7 <NA> <NA>
a structural departure from the my_data_frame
it initially copied.
Since dplyr
(among other packages) assigns by value rather than by reference, a dplyr
solution would sidestep this issue entirely.
As it is, however, you must take care when modifying the workflow, because the version of my_data_table
available before Step 1 cannot be recovered afterwards.
Further Reading
Although the JOIN
ing of data.table
s is explicitly directional — with a "right" side and a "left" side — this model manages to preserve the relational symmetry you described here
if...[either] one's 'Start_Date' is +- 1 year within the other observation's 'End_Date', they are classified as being in the same group.
via the use of an undirected graph.
When the JOIN
relates the 1st row (having a Start_Date
of 2015-01-01
) to the 4th row (having an End_Date
of 2015-01-09
), we gather that the Start_Date
of is "sufficiently close" to (within 1 year of) the End_Date
of . So we say mathematically that ℛ , or
"is in the same group as" .
However, the converse ℛ will not necessarily appear in the JOIN
ed data, because the Start_Date
of might not land so conveniently near the End_Date
of . That is, the JOIN
ed data will not necessarily indicate that
"is in the same group as" .
In the latter case, a strictly directed graph ("digraph") would not capture the common membership of and in the same group. You can observe this jarring difference by setting directed = TRUE
in the first line of Step 2
igraph::graph_from_data_frame(directed = TRUE) %>%
and also setting mode = "strong"
in the very next line
igraph::components(mode = "strong") %>%
to yield these disassociated results:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 4 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 3 FALSE
4: A 2012-04-16 2015-01-09 5 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 1 FALSE
7: B 2018-01-23 <NA> NA FALSE
By contrast, the rows can be properly grouped via the use of an undirected graph (directed = FALSE
); or via more lenient criteria (mode = "weak"
). Either of these approaches will effectively simulate the presence of ℛ whenever ℛ is present in the JOIN
ed data.
This symmetric property is particularly important when modeling the behavior you describe here:
...one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31)...
In this situation, you want the model to recognize that any two rows and must be in the same group ( ℛ ), whenever their End_Date
s match the same Start_Date
of some other row : ℛ and ℛ .
So suppose we know that ℛ and ℛ . Because our model has preserved symmetry, we can say from ℛ that ℛ too. Since we now know that ℛ and ℛ , transitivity implies that ℛ . Thus, our model recognizes that ℛ whenever ℛ and ℛ ! Similar logic will suffice for "vice versa".
We can verify this outcome by using
my_data_frame <- my_data_frame %>%
rbind(list(Name = "A",
Start_Date = as.Date("2010-01-01"),
End_Date = as.Date("2015-01-05")))
to append an 8th row to my_data_frame
, prior to the workflow:
Name Start_Date End_Date
1 A 2015-01-01 2019-12-29
# ⋮ ⋮ ⋮ ⋮
4 A 2012-04-16 2015-01-09
# ⋮ ⋮ ⋮ ⋮
8 A 2010-01-01 2015-01-05
This 8th row serves as our , where is the 1st row and is the 4th row, as before. Indeed, the output
properly classifies and and as belonging to the same group 1
: ℛ .
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
8: A 2010-01-01 2015-01-05 1 FALSE
Likewise, the output
properly Flag
s the 1st row, whose Start_Date
is now matched by two End_Date
s: in the 4th and 8th rows.
Cheers!