4

I need to calculate the distance between two strings in R using sparklyr. Is there a way of using stringdist or any other package? I wanted to use cousine distance. This distance is used as a method of stringdist function.

Thanks in advance.

  • Are you referring to a Hamming distance? If so, you want to use the stringdist package. – C-x C-c Mar 02 '18 at 20:52
  • I was thinking of cousine distance, either way I really need to use the stringdist package, but it doesn't seem to work in sparklyr. I'm seeking a way to use it or a substitute to this package. – Daniel Limaviegas Mar 02 '18 at 21:00
  • 1
    Can you reproduce the attempt that isnt working? – C-x C-c Mar 02 '18 at 21:06

1 Answers1

3

You can use built-in levenshtein function:

df <- copy_to(sc, data.frame(a=c("This is it", "Foo"), b=c("This is", "foobar)))

# df %>% mutate(dist = levenshtein(a, b))
# # Source:   lazy query [?? x 3]
# # Database: spark_connection
#   a          b        dist
#   <chr>      <chr>   <int>
# 1 This is it This is     3
# 2 Foo        foobar      4
user8954262
  • 138
  • 6
  • 1
    Is there a way to use a non-built-in string distance metric with `sparklyr`? Such as Jaro-Winkler, available in this package: https://github.com/MrPowers/spark-stringmetric. – jfeigenbaum Jul 07 '19 at 14:28
  • @jfeigenbaum have you found a way to use a non built-in string distance metric? – johnckane Apr 10 '20 at 19:46
  • @johnckane I didn't spend a lot of time on this but no... I never figured this out – jfeigenbaum Apr 11 '20 at 14:31
  • @jfeigenbaum if you're interested, I answered here how I ultimately did it in pyspark: https://stackoverflow.com/questions/57706352/use-external-library-in-pandas-udf-in-pyspark/61149452#61149452 – johnckane Apr 11 '20 at 20:41