An algorithm for measuring the similarity of two strings, often used for duplicate detection.
Questions tagged [jaro-winkler]
78 questions
110
votes
1 answer
Difference between Jaro-Winkler and Levenshtein distance?
I want to do fuzzy matching of millions of records from multiple files. I identified two algorithms for that: Jaro-Winkler and Levenshtein edit distance.
I was not able to understand what the difference is between the two. It seems Levenshtein gives…

Bhavesh Shah
- 3,299
- 11
- 49
- 73
43
votes
2 answers
Compare similarity algorithms
I want to use string similarity functions to find corrupted data in my database.
I came upon several of them:
Jaro,
Jaro-Winkler,
Levenshtein,
Euclidean and
Q-gram,
I wanted to know what is the difference between them and in what situations…

Ali
- 808
- 2
- 11
- 20
24
votes
4 answers
Jaro–Winkler distance algorithm in C#
How would the Jaro–Winkler distance string comparison algorithm be implemented in C#?

leebickmtu
- 1,565
- 2
- 12
- 20
15
votes
2 answers
jellyfish vs pyjarowinkler
I am trying to use the Jaro-Winkler similarity distance to see if two strings are similar. I tried using both the libraries to compare the words carol and elephant. The results are not similar:
import…

turtle_in_mind
- 986
- 1
- 18
- 36
14
votes
6 answers
Optimizing Jaro-Winkler algorithm
I have this code for Jaro-Winkler algorithm taken from this website. I need to run 150,000 times to get distance between differences. It takes a long time, as I run on an Android mobile device.
Can it be optimized more?
public class Jaro {
/**
…

Pentium10
- 204,586
- 122
- 423
- 502
10
votes
4 answers
String Distance Matrix in Python using pdist
How to calculate Jaro Winkler distance matrix of strings in Python?
I have a large array of hand-entered strings (names and record numbers) and I'm trying to find duplicates in the list, including duplicates that may have slight variations in…

Mark W
- 103
- 1
- 6
7
votes
2 answers
Abbreviation similarity between strings
I have a use case in my project where I need to compare a key-string with a lot many strings for similarity. If this value is greater than a certain threshold, I consider those strings "similar" to my key and based on that list, I do some further…

vish4071
- 5,135
- 4
- 35
- 65
6
votes
3 answers
What string distance algorithm is best for measuring typing accuracy?
I'm trying to write a function that detects how accurate the user typed a particular phrase/sentence/word/words. My objective is to build an app to train the user's typing accuracy of certain phrases.
My initial instinct is to use the basic…

adrianmcli
- 1,956
- 3
- 21
- 49
5
votes
2 answers
Jaro-Winkler Distance Algorithm in .NET
Is there any LGPL or commercial-friendly licensed implementation of Jaro-Winkler distance in .NET?

dr. evil
- 26,944
- 33
- 131
- 201
4
votes
1 answer
How to group similar strings together in a database in R
I have a tibble of just 1 column called 'title'.
> dat
# A tibble: 13 x 1
title
1 lymphoedema clinic
2 zostavax shingles…

Yeshyyy
- 669
- 6
- 21
4
votes
3 answers
Choosing Levenshtein vs Jaro Winkler?
I'm doing an application that computers a large list of brands/domains and detects variations from pre-determined keywords.
Examples:
facebook vs facebo0k.com
linkedIn vs linkedln.com
stackoverflow vs stckoverflow
I'm wondering if for the simply…

Andre
- 598
- 1
- 7
- 18
4
votes
2 answers
How to get an accurate JOIN using Fuzzy matching in Oracle
I'm trying to join a set of county names from one table with county names in another table. The issue here is that, the county names in both tables are not normalized. They are not same in count; also, they may not be appearing in similar pattern…

Dav KR
- 51
- 1
- 4
4
votes
3 answers
Python performance improvement request for winkler
I'm a python n00b and I'd like some suggestions on how to improve the algorithm to improve the performance of this method to compute the Jaro-Winkler distance of two names.
def winklerCompareP(str1, str2):
"""Return approximate string comparator…

Martlark
- 14,208
- 13
- 83
- 99
4
votes
0 answers
how to deal duplicated chars in common strings when applying Jaro String Similarity algorithm
I am struggling the definition of common string between two strings when applying Jaro string similarity algorithm.
say we have
s1 = 'profjohndoe'
s2 = 'drjohndoe'
BY Jaro similarity, the half length is floor(11/2) - 1 = 4, defined by the…

Haochuan Zhou
- 41
- 2
3
votes
0 answers
Jarowinkler as a loadable extension to SQLite
I was wondering if anyone has implemented the Jarowinkler function as a loadable extension to SQLite.
I am looking for an equivalent to the " SQLite-Levenshtein". A great implementation of the levenstehein distance as an SQLite loadable extenstion…

Mike
- 31
- 2