Questions tagged [fuzzy-comparison]

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly).

Fuzzy comparison is the colloquial name for Approximate String matching, the technique of finding strings that match a pattern approximately (rather than exactly). This problem is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.


Useful links


Related tags

361 questions
232
votes
12 answers

Good Python modules for fuzzy string comparison?

I'm looking for a Python module that can do simple fuzzy string comparisons. Specifically, I'd like a percentage of how similar the strings are. I know this is potentially subjective so I was hoping to find a library that can do positional…
Soviut
  • 88,194
  • 49
  • 192
  • 260
90
votes
5 answers

Fuzzy String Comparison

What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive a score of 1 and a sentence which is the total…
jacksonstephenc
  • 901
  • 1
  • 7
  • 3
52
votes
6 answers

Fuzzy Regular Expressions

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes. Now I have a need to match strings against simple regular expressions such TV…
Thomas Ahle
  • 30,774
  • 21
  • 92
  • 114
50
votes
4 answers

Techniques for finding near duplicate records

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!". My plan was to…
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
42
votes
7 answers

How can I match fuzzy match strings from two datasets?

I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had…
A L
  • 613
  • 1
  • 7
  • 7
29
votes
10 answers

Fuzzy regular expressions

I am looking for a way to do a fuzzy match using regular expressions. I'd like to use Perl, but if someone can recommend any way to do this that would be helpful. As an example, I want to match a string on the words "New York" preceded by a 2-digit…
itzy
  • 11,275
  • 15
  • 63
  • 96
21
votes
5 answers

How can I recognize slightly modified images?

I have a very large database of jpeg images, about 2 million. I would like to do a fuzzy search for duplicates among those images. Duplicate images are two images that have many (around half) of their pixels with identical values and the rest are…
Eyal
  • 5,728
  • 7
  • 43
  • 70
20
votes
2 answers

How to apply machine learning to fuzzy matching

Let's say that I have an MDM system (Master Data Management), whose primary application is to detect and prevent duplication of records. Every time a sales rep enters a new customer in the system, my MDM platform performs a check on existing…
blackgreen
  • 34,072
  • 23
  • 111
  • 129
18
votes
1 answer

elasticsearch fuzzy matching max_expansions & min_similarity

I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title. As I…
14
votes
1 answer

fuzzy matching in R

I am trying to detect matches between an open text field (read: messy!) with a vector of names. I created a silly fruit example that highlights my main challenges. df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6), entry = c("Apple", …
Eric Green
  • 7,385
  • 11
  • 56
  • 102
13
votes
2 answers

Partitioning data on a variable to speed up "fuzzy match" using stringdist

I am building on an answer provided to a previous question about fuzzy matching using stringdist. I have two large datasets (~30k rows) with long strings (consumer product names) that I want to fuzzy match by generating a distance score. There is…
roody
  • 2,633
  • 5
  • 38
  • 50
12
votes
1 answer

Fuzzy Matching Numbers

I've been working with Double Metaphone and Caverphone2 for String comparisons and they work good on things like names, addresses, etc (Caverphone2 is working best for me). However, they produce way too many false positives when you get to numeric…
ElJeffe
  • 637
  • 1
  • 8
  • 20
12
votes
1 answer

Joining two datasets using fuzzy logic

I’m trying to do a fuzzy logic join in R between two datasets: first data set has the name of a location and a column called config second data set has the name of a location and two additional attributes that need to be summarized before they are…
steppermotor
  • 701
  • 6
  • 22
12
votes
7 answers

Using pen strokes with fuzzy tolerance algorithm as encryption key

How can I encrypt/decrypt with fuzzy tolerance? I want to be able to use a Stroke on an InkCanvas as key for my encryption but when decrypting again the user should not have to draw the exact same symbol, only similar. Can this be done in .NET…
Andreas Zita
  • 7,232
  • 6
  • 54
  • 115
12
votes
3 answers

How to merge two pandas DataFrames based on a similarity function?

Given dataset 1 name,x,y st. peter,1,2 big university portland,3,4 and dataset 2 name,x,y saint peter3,4 uni portland,5,6 The goal is to merge on d1.merge(d2, on="name", how="left") There are no exact matches on name though. So I'm looking to do…
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
1
2 3
24 25