As the title suggested, my goal is to do partial match between two factors (with comma seperated values) in seperate data frames that have at least two matching elements
I have two dataframes like this:
df1
structure(list(ID = c(55, 153, 274, 380, 34, 156), value = c("30002, 10057, 10012, 30045, 10065, 10207, 10013, 20056, 20024, 13026, 10032, 10031",
"10026, 10051, 10010, 10302", "10004, 10133, 10103", "10009, 10035",
"10003, 10202, 10319, 10421, 10025, 10033, 10045, 10036, 10049, 10055, 10062, 10069, 10083, 10086, 10089, 10090, 10099, 10100, 10102, 10103, 10112, 10114, 10120, 10125, 10126, 10128, 10144, 10148, 10149, 10150, 10158, 10159, 11330, 10035, 13508, 12003, 10124, 100266, 11302, 15305, 10240, 25024, 23003, 25204, 25343, 23058, 22007, 25278, 25204, 30117, 25346, 22324, 25325, 25133, 25229",
"11002, 11107, 13340, 10344")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -6L))
df2
```r
structure(list(ID = c(75, 412, 289, 214, 48, 222), value = c("30002, 10041, 10031, 20024, 13026",
"10026, 10040", "10004, 10133", "10023, 10025, 10314, 10143",
"10001, 10125, 10126, 10128",
"10012, 10020, 10344")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))
As shown above: 1) the ID is not in order and I sorted them by column "value", 2) number of elements in each row can be different, 3) element inside a "list" can be out of order 4) The original dataframe is large, so I think there might be multiple match, so I would like to output match count as well
***NOTE that here I ordered the dataset by "value" so it look like they are matched line by line but it's actually not the case if you look at the whole dataset, the goal is really to look for matches item by item.
I want my resulting df to return both ID of the dataframe and matching element if any two lists have at least two common elements: For example ID 55 in df1 and ID 75 in df1- would return something like
ID_1 ID_2 Matched_element Match_count
75 55 30002,20024,13026 3
I tried to use string split to make the variable value "list", but this still doesn't help with my partial match.
df1<-str_split(df1$value, ",")
df2<-str_split(df2$value, ",")
Number of Matches Between Two Comma Separated Factors in a Data Frame This question is very similar to the quetsion I'm asking, but can't solve my problem because it is matching on a row-by-row basis.