1

As the title suggested, my goal is to do partial match between two factors (with comma seperated values) in seperate data frames that have at least two matching elements

I have two dataframes like this:

df1

structure(list(ID = c(55, 153, 274, 380, 34, 156), value = c("30002, 10057, 10012, 30045, 10065, 10207, 10013, 20056, 20024, 13026, 10032, 10031", 
"10026, 10051, 10010, 10302", "10004, 10133, 10103", "10009, 10035", 
"10003, 10202, 10319, 10421, 10025, 10033, 10045, 10036, 10049, 10055, 10062, 10069, 10083, 10086, 10089, 10090, 10099, 10100, 10102, 10103, 10112, 10114, 10120, 10125, 10126, 10128, 10144, 10148, 10149, 10150, 10158, 10159, 11330, 10035, 13508, 12003, 10124, 100266, 11302, 15305, 10240, 25024, 23003, 25204, 25343, 23058, 22007, 25278, 25204, 30117, 25346, 22324, 25325, 25133, 25229", 
"11002, 11107, 13340, 10344")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -6L))

df2
```r
structure(list(ID = c(75, 412, 289, 214, 48, 222), value = c("30002, 10041, 10031, 20024, 13026", 
"10026, 10040", "10004, 10133", "10023, 10025, 10314, 10143", 
"10001, 10125, 10126, 10128", 
"10012, 10020, 10344")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))

As shown above: 1) the ID is not in order and I sorted them by column "value", 2) number of elements in each row can be different, 3) element inside a "list" can be out of order 4) The original dataframe is large, so I think there might be multiple match, so I would like to output match count as well

***NOTE that here I ordered the dataset by "value" so it look like they are matched line by line but it's actually not the case if you look at the whole dataset, the goal is really to look for matches item by item.

I want my resulting df to return both ID of the dataframe and matching element if any two lists have at least two common elements: For example ID 55 in df1 and ID 75 in df1- would return something like

ID_1 ID_2  Matched_element     Match_count
75   55    30002,20024,13026   3

I tried to use string split to make the variable value "list", but this still doesn't help with my partial match.

df1<-str_split(df1$value, ",")
df2<-str_split(df2$value, ",")

Number of Matches Between Two Comma Separated Factors in a Data Frame This question is very similar to the quetsion I'm asking, but can't solve my problem because it is matching on a row-by-row basis.

dangodango
  • 29
  • 2
  • 3
    please, output the dataframes using `dput` – Omar Abd El-Naser Jul 27 '19 at 00:45
  • *"As shown above: 1) the ID is completely scrambled"* I'm not sure I understand what you mean by scrambled. It's also not clear to me what the `"..."` denote in your second dataset. Do you actually have rows that contain only a set of dots? I second the previous commenter's request to post minimal data in a reproducible and unambiguous way using `dput`. If unsure, take a look at how to provide a [minimal reproducible example/attempt](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Maurits Evers Jul 27 '19 at 03:09
  • Hi @djangodjango, welcome to Stack! As other folks have mentioned, if you can do ```dput(df1)``` and post that in your question (use the "edit" option), that will greatly help others try to answer your question. Trying to recreate your first one, I'd start with something like this ```ID <- c(1,2,3)```, ```Value <- c("27003, 47013, 17010, 17129, 75339, 17140", "10325, 10049, 10002, 10009, 10010", "19007, 19000")```, ```df1 <- data.frame(ID, Value)``` but does your dataframe have factors, etc? – Russ Thomas Jul 27 '19 at 03:40
  • And if you have a large dataframe and just want to post a certain number of lines of it, you can use ```head()``` as well. So to provide the first 3 lines of your dataframe you would use ```dput(head(df1,3))``` – Russ Thomas Jul 27 '19 at 03:48
  • @RussThomas It doesn't have factors. I have edited the part, Thanks for your suggestion! – dangodango Jul 27 '19 at 18:03
  • @MauritsEvers I have rephrase the part. it simply means that I can't match by ID because the data is not in order in any way. – dangodango Jul 27 '19 at 18:04
  • @dangodango is for loop good with you ? – Omar Abd El-Naser Jul 27 '19 at 22:49
  • @OmarAbdEl-Naser I still want to try fuzzy-join, but I don't think computing time is an issue for me for now-- so for loop is fine! – dangodango Jul 27 '19 at 23:06
  • @dangodango i'll try to remove the for loop with another functions, it's just trial to prove the idea – Omar Abd El-Naser Jul 27 '19 at 23:08
  • @dangodango There are inconsistencies in your example output; that aside, this seems to be in essence a straightforward inner join. Please see my post below. – Maurits Evers Jul 28 '19 at 10:31

2 Answers2

2

I don't quite get your expected output. There seem to be inconsistencies/typos

  1. You say that " For example ID 55 in df1 and ID 75 in df1- ", yet in your expected output ID_2 = 55 and ID_1 = 75. Shouldn't it be ID_1 = 55 (since it's coming from df1) and ID_2 = 75 (from df2)?
  2. Why is there no 10031 value in the example output you give? value = 10031 is in both df1 and df2 for ID_1 = 55 and ID_2 = 75.

The inconsistencies aside, this seems to be a fairly straightforward inner_join:

library(tidyverse)
inner_join(
    df1 %>% separate_rows(value),
    df2 %>% separate_rows(value),
    by = "value", suffix = c("_1", "_2")) %>%
    group_by(ID_1, ID_2) %>%
    summarise(value = toString(value))
## A tibble: 7 x 3
## Groups:   ID_1 [5]
#   ID_1  ID_2 value
#  <dbl> <dbl> <chr>
#1    34    48 10125, 10126, 10128
#2    34   214 10025
#3    55    75 30002, 20024, 13026, 10031
#4    55   222 10012
#5   153   412 10026
#6   156   222 10344
#7   274   289 10004, 10133

Explanation: We use separate_rows to separate comma-separated entries in column value, then perform an inner_join on value; we then group_by ID_1 and ID_2 and summarise entries by comma-concatenating values using toString.


To select only those rows where you have at least 3 matching values you can do

inner_join(
    df1 %>% separate_rows(value),
    df2 %>% separate_rows(value),
    by = "value", suffix = c("_1", "_2")) %>%
    group_by(ID_1, ID_2) %>%
    filter(length(value) > 2) %>%
    summarise(
        matched_element = toString(value),
        match_count = length(value))
## A tibble: 2 x 4
## Groups:   ID_1 [2]
#   ID_1  ID_2 matched_element            match_count
#  <dbl> <dbl> <chr>                            <int>
#1    34    48 10125, 10126, 10128                  3
#2    55    75 30002, 20024, 13026, 10031           4
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thank you so much! I didn't know inner join can work :) Yes, you were right, the result has some typo, but this work like a charm – dangodango Jul 28 '19 at 14:25
  • 1
    @dangodango please thump the answer up and make it as answer – Omar Abd El-Naser Jul 28 '19 at 20:28
  • @dangodango As Omar Abd El-Naser suggests, check out the StackOverflow help center article on [What should I do when someone answers my question](https://stackoverflow.com/help/someone-answers). – Russ Thomas Jul 28 '19 at 23:28
0

I'm using nested map

library(stringr)
df1 <- structure(list(ID = c(55, 153, 274, 380, 34, 156), value = c("30002, 10057, 10012, 30045, 10065, 10207, 10013, 20056, 20024, 13026, 10032, 10031", 
                                                             "10026, 10051, 10010, 10302", "10004, 10133, 10103", "10009, 10035", 
                                                             "10003, 10202, 10319, 10421, 10025, 10033, 10045, 10036, 10049, 10055, 10062, 10069, 10083, 10086, 10089, 10090, 10099, 10100, 10102, 10103, 10112, 10114, 10120, 10125, 10126, 10128, 10144, 10148, 10149, 10150, 10158, 10159, 11330, 10035, 13508, 12003, 10124, 100266, 11302, 15305, 10240, 25024, 23003, 25204, 25343, 23058, 22007, 25278, 25204, 30117, 25346, 22324, 25325, 25133, 25229", 
                                                             "11002, 11107, 13340, 10344")), class = c("tbl_df", "tbl", 
                                                                                                       "data.frame"), row.names = c(NA, -6L))

df2 <- structure(list(ID1 = c(75, 412, 289, 214, 48, 222), value1 = c("30002, 10041, 10031, 20024, 13026", 
                                                             "10026, 10040", "10004, 10133", "10023, 10025, 10314, 10143", 
                                                             "10001, 10125, 10126, 10128", 
                                                             "10012, 10020, 10344")), class = c("tbl_df", "tbl", "data.frame"
                                                             ), row.names = c(NA, -6L))
#Change value column into a list of numeric values
df1 <-  df1 %>% mutate(x = map(value,function(x) (as.numeric(unlist(str_split(x,","))))))
df2 <-  df2 %>% mutate(x1 = map(value1,function(x) (as.numeric(unlist(str_split(x,","))))))

#Combine dataframes
df <- cbind(df1,df2)

Final_Data <- enframe(map(df$x,~ map(df$x1,~.y[.y %in% .x],.y = .x))) %>% unnest() %>%  
        mutate(ID_1 = rep(df$ID,each = 6),ID_2 = rep(df$ID1,times=6),
               Length = lengths(value) ) %>% filter(Length > 2 )