How to find common strings across several files

Question

I have data like this:

df1<- structure(list(test = c("SNTM1", "STTTT2", "STOLA", "STOMQ", 
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3", 
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1")), class = "data.frame", row.names = c(NA, 
-17L))

df2<-structure(list(test = c("SNTLK", "STTTFSG", "STOIU", "STOMQ", 
"STR25", "SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1")), class = "data.frame", row.names = c(NA, 
-10L))

and

df3<- structure(list(test = c("SNTLKM", "STTTFSGTT", "GFD", "STOMQ", 
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA, 
-10L))

I want to know how many strings are common across all these 3 data frames.

If it was two, I could do it with match and join function but I want to know how many are shared between df1 and df2 and df3 or a combination.

"similar" like in fuzzy matching or identical? – May 18 '22 at 22:14 — , May 18 '22 at 22:14

score 2 · Answer 1 · answered May 18 '22 at 22:33

2

example (if only identical strings count for duplicates):

library(dplyr)

df1 <- data.frame(test = c("A", "B", "C", "C"))
df2 <- data.frame(test = c("B", "C", "D"))
df3 <- data.frame(test = c("C", "D", "E"))

bind_rows(df1, df2, df3, .id = "origin") %>%
  group_by(origin) %>%
  distinct(test) %>% ## remove within-dataframe duplicates
  group_by(test) %>%
  summarise(replicates = n()) %>%
  filter(replicates > 1)

answered May 18 '22 at 22:33

why dont you use my data? – nik May 18 '22 at 22:35
This is a good answer. But I think the trick is to identify not identical -> similar strings?! – TarJae May 18 '22 at 22:35
@nik: to avoid distraction from the underlying question: https://stackoverflow.com/help/minimal-reproducible-example :-) – May 18 '22 at 22:38
1

@TarJae: I checked with nik in the comments that (only) identity was required (upvoted your extra mile though, nice one!) – May 18 '22 at 22:42

TarJae · Answer 2 · 2022-05-18T22:40:42.373

Here is an update in case only identical strings are wished:

library(dplyr)

bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>% 
  filter(duplicated(test) | duplicated(test, fromLast=TRUE))

   id   test
1 df1  STOMQ
2 df1 TMEIL1
3 df1  YIPF1
4 df2  STOMQ
5 df2 TMEIL1
6 df2  YIPF1
7 df3  STOMQ
8 df3  YIPF1

First answer: Here is a suggestion:

First bring all dataframes in a list of dataframes with an identifier and arrange by the the string. Now you could check visually:

library(dplyr)
x <- bind_rows(list(df1 = df1, df2 = df2, df3 = df3), .id = 'id') %>% 
  arrange(test)

To automate the process you have to use a kind of string distance, there are some different out there and I can't tell which one is better or more appropriate. One example is Jaccard_index https://en.wikipedia.org/wiki/Jaccard_index

Here we use the Jaro-Winkler distance: Learned here: How to group similar strings together in a database in R

in the group column you could find the similar strings:

You can define what does similar mean, by changing the value of "jw". Try and change it from 0.4 to 0.1 then you will see that the groups change:

library(tidyverse)
library(stringdist)

map_dfr(x$test, ~ {
  i <- which(stringdist(., x$test, "jw") < 0.40)
  tibble(index = i, title = x$test[i])
}, .id = "group") %>%
  distinct(index, .keep_all = T) %>% 
  mutate(group = as.integer(group)) +
  bind_cols(df_id = x$id)

  group index title  df_id
   <int> <int> <chr>  <chr>
 1     1     1 BRsts  df3  
 2     2     2 GFD    df3  
 3     3     3 RSEST  df3  
 4     3    31 TRS    df2  
 5     3    32 TRSF   df3  
 6     4     4 SNTLK  df1  
 7     4     5 SNTLKM df2  
 8     4     6 SNTM1  df1  
 9     4     8 STOLA  df1  
10     4    12 STR2   df2  
# ... with 27 more rows

how can I see which ones are similar in df1 and df2, which ones in df1, and df3 and which ones in d2 and df3 and which ones in df1,df2 and df3? similar means " identical " exact — nik, May 18 '22 at 22:20
Just add `bind_cols(df_id = x$id)` to the code. It will get the id that we created prior to distance measuring. I will update my code! — TarJae, May 18 '22 at 22:27
still it does not give me the exactly what I am looking for. Lets say df1 and df2 have several strings that are identical. I want to be able to know which strings are identical between df1 and df2. Then df2 and df3 and then df1 and df2 and df3. the one you showed does not reflect what I want — nik, May 19 '22 at 00:55

How to find common strings across several files

2 Answers2