Create group based on fuzzy criteria

Question

I have a data frame that looks like this:

Name   Start_Date   End_Date
A      2015-01-01   2019-12-29
A      2017-03-25   NA
A      2019-10-17   NA
A      2012-04-16   2015-01-09
A      2002-06-01   2006-02-01
A      2005-12-24   NA
B      2018-01-23   NA

I want to create a column such that, if two observations have the same Name, and one's Start_Date is ±1 year within the other observation's End_Date, they are classified as being in the same group.

Desired output:

Name   Start_Date   End_Date    Wanted
A      2015-01-01   2019-12-29  1
A      2017-03-25   NA          NA
A      2019-10-17   NA          1
A      2012-04-16   2015-01-09  1
A      2002-06-01   2006-02-01  2
A      2005-12-24   NA          2
B      2018-01-23   NA          NA

I am searching for a solution with data table but solving my problem would be enough.

Added: Row-by-row explanation
Row:

Start date is 8 days (< 1 year) before end date for row 4. It is in the same group as row 4.
Start date is 2+ years after row 1's end date. Is not in the same group as row 1. Same with row 4, 5. It is not in the same group as those two either.
Start date is 2 months (< 1 year) before end date for row 1. It is in the same group as row 1.
See row 1.
See below.
Start date is 3 months ( < 1 year) before end date for row 5. It is in the same group as row 5.
No other name B to compare to. It is in its own group.

Therefore, rows 1, 3 and 4 are in the same group. Row 5 and 6 are in the same group. Row 2 and 7 do not have groups.

EDIT: I have updated my code to have consistent Wanted category when an observation does not get matched with another.

the end date is NA so how do you compare the start date to end date? — Onyambu, Jul 12 '21 at 21:08
@Onyambu Is there a more efficient approach than a fuzzy or non-equi `join`, followed by generating a network index with `igraph`, and then grouping the dataset by that index? — Greg, Jul 12 '21 at 21:11
@Onyambu I only want to compare to end dates which are not NA. So I would be comparing the 2nd, 3rd and 6th observations to the 1st, 4th and 5th observations only. I would ` not be comparing the 7th observation to any others since there are no other observations with 'Name' == 'B'. In that case, I would want to automatically classify it as its own group. — EconNoobie, Jul 12 '21 at 21:19
@EconNoobie Should this relationship be transitive? That is, if `"A"` | `2009-01-01` | `2015-01-02` | `...` is "fuzzily grouped" with `"A"` | `2016-01-01` | `NA` | `...`, which is "fuzzily grouped" with `"A"` | `2016-12-31` | `2017-01-01` | `...` in turn, then should they **all** be part of the same group? In theory, this could lead to a "chain" of "close" "links", where those small differences add up from "link" to "link", and ultimately span a decade or more between the first and last "links"—even though they are all part of the same "fuzzy group". — Greg, Jul 12 '21 at 21:37
@Greg That's a good question. I am trying to create the chain of "close" links so that I can map A's movements over time. So I would ideally like to flag situations where one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31) and vice versa. — EconNoobie, Jul 12 '21 at 22:09
@Onyambu the 'End_Date' in row 4 is 9 days over the 'Start_Date' of row 1. Aka, row 1's start date is +- 1 year within row 4's end date. Row 1 and 4 fit the criteria and should be part of the same group. — EconNoobie, Jul 12 '21 at 22:11
Are you comparing Start_Date with ranges of Start_Dates +- 1yr? Or are you comparing End_Date with ranges of Start_Dates +- 1yr? Or are you comparing End_Date with the provided Start_Date to End_Date ranges? Or are you comparing Start_Date with Start_Date +- 1yr till the End_Date? Maybe you can explain the Wanted Column row by row in the OP. — chinsoon12, Jul 12 '21 at 23:41
@chinsoon12 I have added an edit to include row-by-row explanations. Please let me know if it helps. — EconNoobie, Jul 13 '21 at 01:52
@EconNoobie I just about have a `dplyr` solution to your problem, but I've got one last question: what is the convention for handling rows that are in no "fuzzy group" with other rows? Should they be left with `NA` (as you show in row 2), or should they be given a unique group number (as you show in row 7)? Currently, I have `NA`s in all such cases... — Greg, Jul 13 '21 at 15:20
@Greg I have updated my OP to be consistent and left them as NAs. Thanks for pointing that out! — EconNoobie, Jul 14 '21 at 14:49

Greg · Accepted Answer · 2021-07-14T20:28:39.710

Approach

Here's a solution with data.table, as preferred:

I would prefer a solution with data.table but any solutions at all are much appreciated!

While dplyr and fuzzyjoin might appear more elegant, they might also prove less efficient with sufficiently large datasets.

Credit goes to ThomasIsCoding for beating me to the punch on this other question, with an answer that harnesses igraph to index networks in graphs. Here, the networks are the separate "chains" (Wanted groups) comprised of "links" (data.frame rows), which are joined by their "closeness" (between their Start_Dates and End_Dates). Such an approach seemed necessary to model the transitive relationship ℛ requested here

I am trying to create the chain of "close" links so that I can map A's movements over time.

with care to also preserve the symmetry of ℛ (see Further Reading).

Per that same request

So I would ideally like to flag situations where one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31) and vice versa.

and your further clarification

...I would want another column that indicates that [flag].

I have also included a Flag column, to flag each row whose Start_Date is matched by the End_Dates of at least flag_at other rows; or vice versa.

Solution

Using your sample data.frame, reproduced here as my_data_frame

# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
                                Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
                                                       class = "Date"),
                                End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
                                                     class = "Date")),
                           row.names = c(NA, -7L),
                           class = "data.frame")

we apply data.table and igraph (among other packages) as follows:

library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)



# ...
# Code to generate your data.frame 'my_data_frame'.
# ...



# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()


# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)

# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2



#####################################
# BEGIN: Start Indexing the Groups. #
#####################################

# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
  
  ########################################################
  # STEP 1: Link the Rows That Are "Close" to Each Other #
  ########################################################
  
  # Prepare data.table for JOIN, by adding appropriate helper columns.
  .[, `:=`(# Uniquely identify each row (by row number).
           ID = .I,
           # Boundary columns for tolerance threshold.
           End_Low = End_Date - tolerance,
           End_High = End_Date + tolerance)] %>%
    
  # JOIN rows to each other, to obtain pairings.
  .[my_data_table,
    # Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
    # close enough to (within the boundary columns for) the 'End_Date' of y.
    .(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
      y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
    # JOIN criteria:
    on = .(# Only pair rows having the same name.
           Name,
           # Only pair rows whose start and end dates are within the tolerance
           # threshold of each other.
           End_Low <= Start_Date,
           End_High >= Start_Date),
    # Make it an OUTER JOIN, to include those rows without a match.
    nomatch = NA] %>%
  
  # Prepare pairings for network analysis.
  .[# Ensure no row is reflexively paired with itself.
    #   NOTE: This keeps the graph clean by trimming extraneous loops, and it
    #   prevents an "orphan" row from contributing to its own tally of matches.
    !(x.ID == y.ID) %in% TRUE,
    # !(x.ID == y.ID) %in% TRUE,
    # Simplify the dataset to only the pairings (by ID) of linked rows.
    .(from = x.ID, to = y.ID)]



#############################
# PAUSE: Count the Matches. #
#############################

# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
  
  # Include again the missing IDs for y that were never matched by the JOIN.
  .[my_data_table[, .(ID)], on = .(to = ID)] %>%
  
  # For each row y, count every other row x where x R y.
  .[, .(Matches = sum(!is.na(from))), by = to] %>%
  
  # Extract the count column.
  .$Matches


# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
  
  # For each row x, count every other row y where x R y.
  .[, .(Matches = sum(!is.na(to))), by = from] %>%
  
  # Extract the count column.
  .$Matches



#########################################
# RESUME: Continue Indexing the Groups. #
#########################################

# Resume indexing:
output <- output %>%
  
  # Ignore nonmatches (NAs) which are annoying to process into a graph.
  .[from != to, ] %>%
  
  ###############################################################
  # STEP 2: Index the Separate "Chains" Formed By Those "Links" #
  ###############################################################
  
  # Convert pairings (by ID) of linked rows into an undirected graph.
  igraph::graph_from_data_frame(directed = FALSE) %>%
  
  # Find all groups (subgraphs) of transitively linked IDs.
  igraph::components() %>%
  
  # Pair each ID with its group index.
  igraph::membership() %>%
  
  # Tabulate those pairings...
  utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
  
  # ...in a properly named data.table.
  data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
  
  
  
  #####################################################
  # STEP 3: Match the Original Rows to their "Chains" #
  #####################################################
  
  # LEFT JOIN (on ID) to match each original row to its group index (if any).
  .[my_data_table, on = .(ID)] %>%
  
  # Transform output into final form.
  .[# Sort into original order.
    order(ID),
    .(# Select existing columns.
      Name, Start_Date, End_Date,
      # Rename column having the group indices.
      Wanted = Group_Index,
      # Calculate column(s) to flag rows with sufficient matches.
      Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]



# View results.
output

Result

The resulting output is the following data.table:

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

Keep in mind that the Flags are all FALSE simply because your data lacks any Start_Date matched by (at least) two End_Dates; along with any End_Date matched by (at least) two Start_Dates.

Hypothetically, if we lowered flag_at to 1, then the output would Flag every row with even a single match (in either direction):

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1  TRUE
4:    A 2012-04-16 2015-01-09      1  TRUE
5:    A 2002-06-01 2006-02-01      2  TRUE
6:    A 2005-12-24       <NA>      2  TRUE
7:    B 2018-01-23       <NA>     NA FALSE

Warning

Because some data.table operations modify by reference (or "in-place"), the value of my_data_table changes throughout the workflow. After Step 1, my_data_table becomes

   Name Start_Date   End_Date ID    End_Low   End_High
1:    A 2015-01-01 2019-12-29  1 2018-12-29 2020-12-29
2:    A 2017-03-25       <NA>  2       <NA>       <NA>
3:    A 2019-10-17       <NA>  3       <NA>       <NA>
4:    A 2012-04-16 2015-01-09  4 2014-01-09 2016-01-09
5:    A 2002-06-01 2006-02-01  5 2005-02-01 2007-02-01
6:    A 2005-12-24       <NA>  6       <NA>       <NA>
7:    B 2018-01-23       <NA>  7       <NA>       <NA>

a structural departure from the my_data_frame it initially copied.

Since dplyr (among other packages) assigns by value rather than by reference, a dplyr solution would sidestep this issue entirely.

As it is, however, you must take care when modifying the workflow, because the version of my_data_table available before Step 1 cannot be recovered afterwards.

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      4 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      3 FALSE
4:    A 2012-04-16 2015-01-09      5 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      1 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

By contrast, the rows can be properly grouped via the use of an undirected graph (directed = FALSE); or via more lenient criteria (mode = "weak"). Either of these approaches will effectively simulate the presence of ℛ whenever ℛ is present in the JOINed data.

This symmetric property is particularly important when modeling the behavior you describe here:

...one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31)...

In this situation, you want the model to recognize that any two rows and must be in the same group ( ℛ ), whenever their End_Dates match the same Start_Date of some other row : ℛ and ℛ .

So suppose we know that ℛ and ℛ . Because our model has preserved symmetry, we can say from ℛ that ℛ too. Since we now know that ℛ and ℛ , transitivity implies that ℛ . Thus, our model recognizes that ℛ whenever ℛ and ℛ ! Similar logic will suffice for "vice versa".

We can verify this outcome by using

my_data_frame <- my_data_frame %>%
  rbind(list(Name = "A",
             Start_Date = as.Date("2010-01-01"),
             End_Date = as.Date("2015-01-05")))

to append an 8th row to my_data_frame, prior to the workflow:

    Name Start_Date   End_Date
  1    A 2015-01-01 2019-12-29
# ⋮    ⋮      ⋮           ⋮
  4    A 2012-04-16 2015-01-09
# ⋮    ⋮      ⋮           ⋮
  8    A 2010-01-01 2015-01-05

This 8th row serves as our , where is the 1st row and is the 4th row, as before. Indeed, the output properly classifies and and as belonging to the same group 1: ℛ .

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE
8:    A 2010-01-01 2015-01-05      1 FALSE

Likewise, the output properly Flags the 1st row, whose Start_Date is now matched by two End_Dates: in the 4th and 8th rows.

Cheers!

Thank you so much @Greg! This is very helpful and gets most of my problem solved! To clarify, when I said that I wanted to flag the cases where one observation's start date gets matched to two+ observations' end dates, I meant that I would want another column that indicates that. But that was not reflected in the OP and I wasn't clear, so I definitely understand! — EconNoobie, Jul 14 '21 at 14:48
thank you! I just have one follow-up question. The last one I promise. How should I go about flagging those cases where End_Date is matched by two+ Start_Dates as well? I'm trying to get those cases captured under the 'Flag' variable. — EconNoobie, Jul 14 '21 at 17:47
@EconNoobie I just figured it out, though it's a bit of a pain in the workflow. — Greg, Jul 14 '21 at 18:44
@EconNoobie Did you want a **separate flags** for (1) multiple matches to `Start_Date`, and for (2) multiple matches to `End_Date`? Or do you want a **single flag**? If you want a single flag, should it be "raised" when `Start_Date` or (`|`) `End_Date` has multiple matches? — Greg, Jul 14 '21 at 18:52
I am indifferent between the two. Whichever is easier for you/ to execute. — EconNoobie, Jul 14 '21 at 19:02
@EconNoobie I'll leave it up to you at the end of the workflow. In the last operation of **Step 3**, you'll be able to compute whatever `Flag` column(s) work(s) best for you, based off the `Start_Matched` and `End_Matched` columns that I will provide. `Start_Matched` counts how many times a row's `Start_Date` matched another row's `End_Date`, and `End_Matched` counts how many times a row's `End_Date` matched another row's `Start_Date`. — Greg, Jul 14 '21 at 19:07

Create group based on fuzzy criteria

1 Answers1

Approach

Solution

Result

Warning

Further Reading

Cheers!

Linked

Related