Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions

votes

4 answers

Fast Levenshtein distance in R?

Is there a package that contains Levenshtein distance counting function which is implemented as a C or Fortran code? I have many strings to compare and stringMatch from MiscPsycho is too slow for this.

asked Jul 05 '10 at 20:50

mbq

18,510
6
49
72

votes

4 answers

How to know the operations made to calculate the Levenshtein distance between strings?

With the function stringdist, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1…

r string levenshtein-distance stringdist

asked Jun 30 '19 at 20:17

yaki

votes

0 answers

Fuzzy merging in R - seeking help to improve my code

Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance…

r parallel-processing data.table fuzzy-comparison stringdist

asked Apr 04 '15 at 17:38

chameau13

votes

2 answers

R: producing a list of near matches with stringdist and stringdistmatrix

I discovered the excellent package "stringdist" and now want to use it to compute string distances. In particular I have a set of words, and I want to print out near-matches, where "near match" is through some algorithm like the Levenshtein…

r string matrix stringdist

asked Jul 18 '15 at 01:34

vielmetti

1,864
16
23

votes

2 answers

R: Group Similar Addresses Together

I have a 400,000 row file with manually entered addresses which need to be geocoded. There's a lot of different variations of the same addresses in the file, so it seems wasteful to be using API calls for the same address multiple times. To cut down…

r dplyr tidyverse stringdist qdap

asked Sep 10 '20 at 19:26

rsylatian

votes

3 answers

How to use custom SQL function in dbplyr?

I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect) I can easily use the stringdist function from the stringdist package. But my data is very large and I'd like to filter on…

r stringdist dbplyr

asked Jun 02 '18 at 22:57

jfeigenbaum

votes

0 answers

Using stringdist_left_join to join by multiple columns, but not all of them fuzzy

I have a 1.3 million-row dataset of publications and, for each record, I want to retrieve a paper_id from a second dataset with 8.6 million rows. The idea is to use multiple columns from both tables to find matches for dataset1 in dataset2 as shown…

r stringdist fuzzyjoin

asked Feb 23 '21 at 11:58

André Brasil

votes

2 answers

Matching strings with abbreviations; fuzzy matching

I am having trouble matching character strings. Most of the difficulty centers on abbreviation I have two character vectors. I am trying to match words in vector A (typos) to the closes match in vector B. vec.a <- c("ce", "amer", "principl") vec.b…

r string stringr fuzzy stringdist

asked Mar 07 '22 at 16:01

YouLocalRUser

votes

1 answer

How to calculate distance between strings using sparklyr?

I need to calculate the distance between two strings in R using sparklyr. Is there a way of using stringdist or any other package? I wanted to use cousine distance. This distance is used as a method of stringdist function. Thanks in advance.

r sparklyr stringdist

asked Mar 02 '18 at 20:49

Daniel Limaviegas

votes

0 answers

Reduce memory usage with stringdistmatrix

I have a data.table dt of 9k rows (see sample below). I need to compare each rname of dt to each cname of a reference data.table dt.ref. By comparing, I mean computing the Levenshtein ratio. Then, I take the maximum and get my output (see…

r matrix data.table out-of-memory stringdist

asked Aug 09 '17 at 11:06

user2590177

votes

1 answer

Calculate Jaccard similarity between each words in 2 vectors

I need calculate Jaccard similarity between each words in 2 vectors. Each word by each word. And extract most similar word. Here is my bad bad slow code: txt1 <- c('The quick brown fox jumps over the lazy dog') txt2 <- c('Te quick foks jump ovar…

r stringdist

asked Nov 25 '16 at 11:17

Dennix

votes

0 answers

how to deal duplicated chars in common strings when applying Jaro String Similarity algorithm

I am struggling the definition of common string between two strings when applying Jaro string similarity algorithm. say we have s1 = 'profjohndoe' s2 = 'drjohndoe' BY Jaro similarity, the half length is floor(11/2) - 1 = 4, defined by the…

r string algorithm jaro-winkler stringdist

asked Sep 23 '14 at 09:01

Haochuan Zhou

votes

2 answers

How to get nearest matching string along with score from column from another table?

I am trying to get nearest matching string along with the score by using "stringdist" package with method = jw.(Jaro-winkler) First data frame (df_1) consists of 2 columns and I want to get the nearest string from str_2 from df_2 and score for that…

r stringdist

asked Aug 04 '21 at 09:49

san1

votes

1 answer

Efficient way to handle string similarity?

I got stuck on some string similarity issues. This is how my data looks like (the original data is huge): SerialNumber SubSerialID Date AGCC0775CFNDA1040TMT775 AVCC0775CFNDA1040 2018/01/08 AGCC0775CFNDA1040 …

r stringdist

asked Feb 17 '20 at 23:27

mimibao1009

votes

0 answers

Standardize strings in R

I have a data set with some brand names. By preprocessing (e.g lowercasing, removing stopwords, triming the whitespace, string splitting, etc.) I got from 320 distinct cases in the beginning to 114. However, there is still some room for improvement.…

r stringdist

asked Dec 16 '19 at 14:25

Banjo

1,191
1
11
28

2 3

…

10 11 Next