Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
34
votes
4 answers

Fast Levenshtein distance in R?

Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this.
mbq
  • 18,510
  • 6
  • 49
  • 72
10
votes
4 answers

How to know the operations made to calculate the Levenshtein distance between strings?

With the function stringdist, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1…
yaki
  • 125
  • 7
10
votes
0 answers

Fuzzy merging in R - seeking help to improve my code

Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance…
9
votes
2 answers

R: producing a list of near matches with stringdist and stringdistmatrix

I discovered the excellent package "stringdist" and now want to use it to compute string distances. In particular I have a set of words, and I want to print out near-matches, where "near match" is through some algorithm like the Levenshtein…
vielmetti
  • 1,864
  • 16
  • 23
8
votes
2 answers

R: Group Similar Addresses Together

I have a 400,000 row file with manually entered addresses which need to be geocoded. There's a lot of different variations of the same addresses in the file, so it seems wasteful to be using API calls for the same address multiple times. To cut down…
rsylatian
  • 429
  • 2
  • 14
6
votes
3 answers

How to use custom SQL function in dbplyr?

I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect) I can easily use the stringdist function from the stringdist package. But my data is very large and I'd like to filter on…
jfeigenbaum
  • 403
  • 4
  • 13
5
votes
0 answers

Using stringdist_left_join to join by multiple columns, but not all of them fuzzy

I have a 1.3 million-row dataset of publications and, for each record, I want to retrieve a paper_id from a second dataset with 8.6 million rows. The idea is to use multiple columns from both tables to find matches for dataset1 in dataset2 as shown…
4
votes
2 answers

Matching strings with abbreviations; fuzzy matching

I am having trouble matching character strings. Most of the difficulty centers on abbreviation I have two character vectors. I am trying to match words in vector A (typos) to the closes match in vector B. vec.a <- c("ce", "amer", "principl") vec.b…
YouLocalRUser
  • 309
  • 1
  • 9
4
votes
1 answer

How to calculate distance between strings using sparklyr?

I need to calculate the distance between two strings in R using sparklyr. Is there a way of using stringdist or any other package? I wanted to use cousine distance. This distance is used as a method of stringdist function. Thanks in advance.
4
votes
0 answers

Reduce memory usage with stringdistmatrix

I have a data.table dt of 9k rows (see sample below). I need to compare each rname of dt to each cname of a reference data.table dt.ref. By comparing, I mean computing the Levenshtein ratio. Then, I take the maximum and get my output (see…
user2590177
  • 167
  • 1
  • 11
4
votes
1 answer

Calculate Jaccard similarity between each words in 2 vectors

I need calculate Jaccard similarity between each words in 2 vectors. Each word by each word. And extract most similar word. Here is my bad bad slow code: txt1 <- c('The quick brown fox jumps over the lazy dog') txt2 <- c('Te quick foks jump ovar…
Dennix
  • 109
  • 1
  • 8
4
votes
0 answers

how to deal duplicated chars in common strings when applying Jaro String Similarity algorithm

I am struggling the definition of common string between two strings when applying Jaro string similarity algorithm. say we have s1 = 'profjohndoe' s2 = 'drjohndoe' BY Jaro similarity, the half length is floor(11/2) - 1 = 4, defined by the…
3
votes
2 answers

How to get nearest matching string along with score from column from another table?

I am trying to get nearest matching string along with the score by using "stringdist" package with method = jw.(Jaro-winkler) First data frame (df_1) consists of 2 columns and I want to get the nearest string from str_2 from df_2 and score for that…
san1
  • 455
  • 2
  • 11
3
votes
1 answer

Efficient way to handle string similarity?

I got stuck on some string similarity issues. This is how my data looks like (the original data is huge): SerialNumber SubSerialID Date AGCC0775CFNDA1040TMT775 AVCC0775CFNDA1040 2018/01/08 AGCC0775CFNDA1040 …
3
votes
0 answers

Standardize strings in R

I have a data set with some brand names. By preprocessing (e.g lowercasing, removing stopwords, triming the whitespace, string splitting, etc.) I got from 320 distinct cases in the beginning to 114. However, there is still some room for improvement.…
Banjo
  • 1,191
  • 1
  • 11
  • 28
1
2 3
10 11