34

Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this.

smci
  • 32,567
  • 20
  • 113
  • 146
mbq
  • 18,510
  • 6
  • 49
  • 72

4 Answers4

21

And stringdist in the stringdist package does it too, even faster than levenshteinDist under certain conditions (1)

Ben
  • 41,615
  • 18
  • 132
  • 227
  • 3
    stringdist has sped up significantly since that blog you link to: it now uses multiple cores. –  Feb 26 '16 at 17:02
17

levenshteinDist (from the RecordLinkage package) calls compiled C code. Give it a try.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
gd047
  • 29,749
  • 18
  • 107
  • 146
  • 2
    Just noting the RecordLinkage package is apparently no longer maintained and has been pulled from CRAN. The `stringdist` package is the solution now. – Brian Stamper Feb 27 '20 at 17:42
  • Just noting the RecordLinkage package is *not* pulled from CRAN, it’s just available: https://cran.r-project.org/web/packages/RecordLinkage/ – MS Berends Aug 12 '22 at 19:41
6

You could try stringDist from Biostrings as well

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Aaron Statham
  • 2,048
  • 1
  • 15
  • 16
1

You could also use levenshtein_distance() from the textTinyR package. I got 'calloc' memory errors with all other packages when it came to larger character vectors of around 30k characters. Only textTinyR worked for me!

interrobang
  • 83
  • 1
  • 7