Questions tagged [similarity]

Similarity measures quantify how much alike objects (e.g. documents, feature vectors) are.

In information retrieval, is used to describe the relevance between document vectors. The measurement is further used to rank search results.

1866 questions
471
votes
16 answers

Find the similarity metric between two strings

How do I get the probability of a string being similar to another string in Python? I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library. e.g. similar("Apple","Appel") #would have a high…
tenstar
  • 9,816
  • 9
  • 24
  • 45
196
votes
6 answers

Checking images for similarity with OpenCV

Does OpenCV support the comparison of two images, returning some value (maybe a percentage) that indicates how similar these images are? E.g. 100% would be returned if the same image was passed twice, 0% would be returned if the images were totally…
Boris
  • 8,551
  • 25
  • 67
  • 120
161
votes
25 answers

A better similarity ranking algorithm for variable length strings

I'm looking for a string similarity algorithm that yields better results on variable length strings than the ones that are usually suggested (levenshtein distance, soundex, etc). For example, Given string A: "Robert", Then string B: "Amy…
marzagao
  • 3,756
  • 4
  • 19
  • 14
89
votes
10 answers

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times. Say the input matrix is: A= [0 1 0 0 1 0 0 1 1 1 1 1 0 1…
zbinsd
  • 4,084
  • 6
  • 33
  • 40
86
votes
8 answers

Calculate cosine similarity given 2 sentence strings

From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate cosine similarity between 2 strings? s1 = "This is a…
alvas
  • 115,346
  • 109
  • 446
  • 738
81
votes
7 answers

How to calculate distance similarity measure of given 2 strings?

I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example: The real word: hospital Mistaken word: haspita Now my aim is to determine how many characters I need to modify the mistaken word to…
Furkan Gözükara
  • 22,964
  • 77
  • 205
  • 342
80
votes
3 answers

How to find similar results and sort by similarity?

How do I query for records ordered by similarity? Eg. searching for "Stock Overflow" would return Stack Overflow SharePoint Overflow Math Overflow Politic Overflow VFX Overflow Eg. searching for "LO" would return: pabLO…
Robin Rodricks
  • 110,798
  • 141
  • 398
  • 607
71
votes
6 answers

Comparing strings with tolerance

I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on. Is there some kind of framework which can…
Oliver Hanappi
  • 12,046
  • 7
  • 51
  • 68
63
votes
15 answers

Algorithm to find articles with similar text

I have many articles in a database (with title,text), I'm looking for an algorithm to find the X most similar articles, something like Stack Overflow's "Related Questions" when you ask a question. I tried googling for this but only found pages…
Osama Al-Maadeed
  • 5,654
  • 5
  • 28
  • 48
58
votes
12 answers

String similarity score/hash

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not…
Josef Sábl
  • 7,538
  • 9
  • 54
  • 66
56
votes
3 answers

Java library to compare image similarity

I spent quite some time researching for a library that allows me to compare images to one another in Java. I didn't really find anything useful, maybe my GoogleSearch-skill isn't high enough so I thought I'd ask you guys if you could point me into a…
F.P
  • 17,421
  • 34
  • 123
  • 189
53
votes
1 answer

Finding similar strings with PostgreSQL quickly

I need to create a ranking of similar strings in a table. I have the following table create table names ( name character varying(255) ); Currently, I'm using pg_trgm module which offers the similarity function, but I have an efficiency problem. I…
cdarwin
  • 4,141
  • 9
  • 42
  • 66
49
votes
10 answers

Figure out if a business name is very similar to another one - Python

I'm working with a large database of businesses. I'd like to be able to compare two business names for similarity to see if they possibly might be duplicates. Below is a list of business names that should test as having a high probability of being…
Chris Dutrow
  • 48,402
  • 65
  • 188
  • 258
47
votes
3 answers

Python: Semantic similarity score for Strings

Are there any libraries for computing semantic similarity scores for a pair of sentences ? I'm aware of WordNet's semantic database, and how I can generate the score for 2 words, but I'm looking for libraries that do all pre-processing tasks like…
user8472
  • 726
  • 1
  • 8
  • 16
43
votes
2 answers

Compare similarity algorithms

I want to use string similarity functions to find corrupted data in my database. I came upon several of them: Jaro, Jaro-Winkler, Levenshtein, Euclidean and Q-gram, I wanted to know what is the difference between them and in what situations…
Ali
  • 808
  • 2
  • 11
  • 20
1
2 3
99 100