1

For the purposes of the reprex I've generated a tibble called random_DNA_tbl that is a random selection of 10 DNA sequences (of 100 bases). I've got a separate tibble called subseq_tbl, with 3 shorter sequences that match 100% to 3 of the sequences in random_DNA_tbl, but I'd also like to use fuzzy matching of sequences from subseq_tbl to other sequences in random_DNA_tbl. I was hopping to be able to use the fuzzyjoin package stringdist_XX_join functions, however these don't seem to work, even though the subseq sequences are actually perfect matches and do work with other matching functions, e.g. regex_XX_join.

library(tidyverse)
library(fuzzjoin)
random_DNA_tbl <- structure(list(random_name = c("random_seq_1", "random_seq_2", 
                               "random_seq_3", "random_seq_4", "random_seq_5", "random_seq_6", 
                               "random_seq_7", "random_seq_8", "random_seq_9", "random_seq_10"
), random_seq = c("CTCCAGTATTAGTCAATGATAAGGGCGAAGGAGCAGTTCTGATATCTCTGTGAAGTAGCATGCGTCTGACTCTCGGGCGCGGCGGAAGACCGAGGAGCGC", 
                  "TTTTCGTCCGACAGAACATCATATAAACTCGATTTAATCTTCTTTTCAAAATCAATTCGAGGGCACCCGATGCGCGTACTGTCAACCATCAAGATAACGA", 
                  "GAATAGTGTACCAGGTCTTATAGTATGTTCATTCGTACAAAAGGATCCAAAACCAATAGGAACCGCTTCTCCCAACAAGCCTGCTCCTTGCAGAGTGAGT", 
                  "GTGACGCCAGATTCTTGACCTGAACCCAGTTCTACCCCCCCAAAACGATCTGGCTTCCGCTCTCTAATGACAGCTATATTGCTTGATAGAGATCGGTAGG", 
                  "ACCGCCTTCCGTAGGTGAACAACCAGCCTCCTGCGGCCAGGGAAGAAGTCGTGGCCTTGGTTAATTTTGGGTTACTAAACGGACACCCACCGTGGCTCAC", 
                  "ACGACTATCAAGACAACTTGTCTCAGAGCTTCACGCACCAACCCCTAACCCAGCAACTCCAGGGCATTGCCACTCTATGATTCGGCGCGGGTGCGCCCTC", 
                  "GGTAGCACTGAGATCAGCCACTATCAAGGTGCTCCTCACTTCTGGTTCTCAGGTTGCGGGCCGATCATTTTTCTCCGAATTAGCGGTCTTTCACGTCAGA", 
                  "CACTGAATAGTCAGCGTAAAGGCGTCAATCTGTCAGCTCGACGGCAGAAGATGTCCAGCGTGCAGTTTCATAGGCGCCCCGGGGAACCTTCTGTGAGAAT", 
                  "GCCTCTTAATTCTTGAACCGCGAGAGGACACAGTGAGATCTGTTCCATTTCCCCCGTTGCCCGCATGGATCGCCCAGACTCTAGACTTAGTGTGACCTTT", 
                  "CGGTATCGGATTGGTCTACGAATCCGCGACCCTCAAGGTTATTTCTGGATGGAGTTCCGTGCTCGCCTGGATGCACTGCCCAAGCAATTAGGACGAAGTA"
)), .Names = c("random_name", "random_seq"), row.names = c(NA, 
                                                           -10L), class = c("tbl_df", "tbl", "data.frame"))

subseq_tbl <- structure(list(subseq_name = c("subseq1", "subseq2", "subseq3"
), subseq = c("TCAACCATCAAGATAAC", "TAGCGGTCTTTCACGT", "AAGGATCC"
)), .Names = c("subseq_name", "subseq"), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))

Doesn't work:

stringdist_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))

Does work:

regex_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))

I've tried tweaking the max_dist parameter in stringdist but to no avail. Can anyone shed any light on the problem please?

biomiha
  • 1,358
  • 2
  • 12
  • 25
  • Have you checked out `?stringdist::stringdist`, and reviewed what all the different methods are? The default, 'osa', is not really appropriate for matching subsequences -- unlike regex. – David Klotz Feb 15 '18 at 18:35
  • 2
    Also take a look at this (much more informative than my comment): https://stackoverflow.com/questions/32914357/dplyr-inner-join-with-a-partial-string-match . The important takeaway is you can do this as a fuzzy alternative: `fuzzy_left_join(random_DNA_tbl, subseq_tbl, by = c('random_seq' = "subseq"), match_fun = stringr::str_detect)`. Of course `str_detect` uses regex, so not sure if there's any advantage there. – David Klotz Feb 15 '18 at 18:39
  • Thanks for alerting me to the SO question. It's strikingly similar (not sure how I missed it to be honest), with one important distinction however. The OP in that one is looking for a partial 100% match, whereas I've need fuzzy matching. Can you pass arguments to match_fun? – biomiha Feb 15 '18 at 21:36
  • If you know how specifically you want to define your fuzzy match, you can write a custom function. In this format, you just need to create a function that returns a logical: TRUE (fuzzy match works), FALSE (fuzzy match fails) – David Klotz Feb 15 '18 at 21:39
  • Yeah, I figured. Still not sure why stringdist doesn't work out of the box. My problem should fit the described use case in the vignette. – biomiha Feb 15 '18 at 21:41
  • 1
    Try your code, but set `max_dist = 85`, then `max_dist = 80`. By definition, the 'osa' string distance will be large if the substring is much smaller than the string. Other methods might get you better results (e.g. 'jaccard', 'jw'), or maybe not since you're only dealing with four characters. – David Klotz Feb 15 '18 at 21:48

0 Answers0