81

I need to compare strings to decide whether they represent the same thing. This relates to case titles entered by humans where abbreviations and other small details may differ. For example, consider the following two titles:

std::string first = "Henry C. Harper v. The Law Offices of Huey & Luey, LLP";

As opposed to:

std::string second = "Harper v. The Law Offices of Huey & Luey, LLP";

A human can quickly gauge that these are most likely one and the same. The current approach I have taken is to normalize the strings by lowercasing all letters and removing all punctuation and spaces giving:

std::string firstNormalized = "henrycharpervthelawofficesofhueylueyllp";

And:

std::string secondNormalized = "harpervthelawofficesofhueylueyllp";

Comparing in this case, one is a sub-sequence of the other, but you can imagine other more complex variations where that does not necessarily occur, yet they have significant sub-sequences in common. There could also be occasional human entry errors such as transposed letters and spelling errors.

Perhaps some kind of character diff program could help? I've seen good line diff programs for comparing differences in code to be checked in, is there something like that on a character basis, maybe in boost? If you could count the number of consecutive characters in common and take the ratio to the characters unshared, perhaps that would be a good heuristic?

In the end, I need a Boolean decision as to whether to consider them the same or not. It doesn't have to be perfect, but it should ideally rarely be wrong.

What algorithm can I use that will give me some kind of quantification as to how similar the two strings are to each other which I can then convert into a yes/no answer by way of some heuristic?

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
WilliamKF
  • 41,123
  • 68
  • 193
  • 295
  • 7
    I've used the Levenshtein distance before. Easy to implement... http://en.wikipedia.org/wiki/Levenshtein_distance – souldzin Mar 08 '13 at 21:32
  • Is there a Levenshtein distance in Boost? – WilliamKF Mar 08 '13 at 21:36
  • 1
    Sorry, not constructive... Here is the [wiki page you were looking for](http://en.wikipedia.org/wiki/String_metric). – djechlin Mar 08 '13 at 21:41
  • @djechlin Why? This is an interesting question. – Stephan Dollberg Mar 08 '13 at 21:50
  • @WhozCraig: Thanks, but that would not be fair, make that your answer and collect the rep. :) – Daniel Frey Mar 08 '13 at 21:57
  • @bamboon There is no c++ or boost in the question. There isn't a problem the OP has, other than not having looked for long enough to find what algorithms are available. Not meaning to be rude, but there are probably better places to ask this. Yes it's interesting to me too. – Peter Wood Mar 08 '13 at 21:59
  • @MarkRansom: That's the point, WhozCraig did that, so I won't take the credit for it. – Daniel Frey Mar 08 '13 at 22:00
  • @DanielFrey, sorry I didn't see that. I wouldn't worry about who gets credit, unless it makes you uncomfortable to have your name associated with something that isn't really yours. From the site's perspective, the most important point is to have answers. – Mark Ransom Mar 08 '13 at 22:05
  • tons of implementations for edit distance: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance – zzk Mar 08 '13 at 22:08
  • Concerning your request for a 'boolean' decision the best algorithms I believe only return a distance between two strings. Given your case, you might want to consider a method that compares the words of one string to the words of another. – souldzin Mar 08 '13 at 22:11
  • @MarkRansom: I don't have a problem with my name being associated, I just think it's unfair to take credit for other people's work. But with all those comments around, it should be clear by now :) – Daniel Frey Mar 08 '13 at 22:16
  • possible duplicate of [Getting the closest string match](http://stackoverflow.com/questions/5859561/getting-the-closest-string-match) – Alain Mar 09 '13 at 03:38
  • @bamboon because OP is asking for a list and there are multiple correct answers because there aren't enough criteria. It's a discussion on string algorithms. Good SO questions have a unique best answer. – djechlin Mar 09 '13 at 04:52
  • Another very useful article for those interested: http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/ – aug Mar 05 '15 at 23:01

6 Answers6

110

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

Daniel Frey
  • 55,810
  • 13
  • 122
  • 180
18

Damerau Levenshtein distance is another algorithm for comparing two strings and it is similar to the Levenshtein distance algorithm. The difference between the two is that it can also check transpositions between characters and hence may give a better result for error correction.

For example: The Levenshtein distance between night and nigth is 2 but Damerau Levenshtein distance between night and nigth will be 1 because it is just a swap of a pair of characters.

Captain Man
  • 6,997
  • 6
  • 48
  • 74
Ankit Chaurasia
  • 785
  • 7
  • 7
8

You could use ngrams for that. For example, transform the two strings in word trigrams (usually lowercase) and compare the percentage of them that are equal to one another.

Your challenge is to define a minimum percentage for similarity.

http://en.wikipedia.org/wiki/N-gram

noderman
  • 1,934
  • 1
  • 20
  • 36
5

Another algorithm that you can consider is the Simon White Similarity:

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    if string is None:
        return ""

    s = string.lower()
    return [s[i : i + 2] for i in list(range(len(s) - 1))]
def simon_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union = len(pairs1) + len(pairs2)

    if union == 0 or union is None:
        return 0

    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union
1

You may use the algorithm of computing the length of the longest common sub-sequence to solve the problem. If the length of the longest common sub-sequence for both the input strings is less than the length of either of the strings, they are unequal.

You may use the approach of dynamic programming to solve the problem and optimize the space complexity as well in case you don't wish to figure out the longest common sub-sequence.

nmg_vikas
  • 11
  • 2
  • This answer ignores the OP's mention that the comparison relates "to case titles entered by humans where abbreviations and other small details may differ". However, it answers (very specifically), the OP's previous wording, asking how to "to compare strings to decide whether they represent the same thing", if "thing" = "string". – bballdave025 Dec 01 '22 at 18:46
0

The string similarity metrics mentioned here can all work. However, further normalization of your text in very specific ways can lead to much better results.

TL;DR

From my experience in speech recognition and handwriting recognition, I think your problem will be best solved by using Text Normalization (archived) followed by a Word Error Rate (archived). A good, quick overview of this process is given at an Amazon AWS Machine Learning Blog post on Evaluating Speech Recognition (archived). A good (somewhat standard) tool for doing both parts of the process (normalization and scoring) is NIST's SCTK. First use rfilter1 for the text normalization, then use sclite to get the score. Figure out based on the score which strings you consider to match.

Further details

I think that there are three areas of study/application which the problems faced are very similar to your problem. They are: 1) Speech Recognition (archived) (a domain "where abbreviations and other small details may differ"); and in the related solutions of 2) Optical Character Recognition (archived) and 3) Handwriting Recognition (archived) (both domains "where [data is] entered by humans [and] where abbreviations and other small details may differ".).

It's especially useful to look at the scoring of automated or human transcribers/recognizers and at the problem of searching for strings in any such transcription. From my experience in these domains, it seems that the best similarity comparisons come from Levenshtein Distance (archived) using words instead of characters for finding the edit distance; this is called the Word Error Rate. A normalization of text, including case, punctuation, and such things as abbreviations, makes the comparison even better.

A Quick Example

It seems you're using C or C++. sclite and rfilter1 are mostly written in C. I hope that an example using bash+sclite will suffice.

Contents of law.glm, a VERY minimal GLM file (GLobal Mapping file, i.e. pairs of search and replace rules)

;;
* name "law.glm"
* desc "Showing extra normalization"
* format = "NIST1"     ;; other option is NIST2
* max_nrules = "1000"  ;; allocating space (I can update this if necessary)
* copy_no_hit = "T"    ;; don't ignore the line if there isn't a match
* case_sensitive = "F"

. => / __ [ ] ;; changes only if there's a space after it
, => / __ [ ]
? => / __ [ ]
! => / __ [ ]
versus => v / [ ] __ [ ] ;; changes only if there's a space before & after
vs => v / [ ] __ [ ]
& => and / [ ] __ [ ]
llp => limited liability partnership / [ ] __ [ ]
llc => limited liability company / [ ] __ [ ]
it's => it is / [ ] __ [ ]
shoppe => shop / [ ] __ [ ]
mister => Mr / [ ] __ [ ]

Now, in bash.

$ first="Henry C. Harper v. The Law Offices of Huey & Luey, LLP (spk1_1)" 
$ second="Harper v. The Law Offices of Huey & Luey, LLP (spk1_1)"
#
# sclite requires actual files.
# It also requires something after the string (an ID, which has been
#+ put in as '(spk1_1)', don't worry about the details.)
#
$ echo "${first}" > first.txt
$ echo "${second}" > second.txt
#
# normalization
$ rfilter1 law.glm < first.txt > first_glm_normalized.txt
$ tr [A-Z] [a-z] < first_glm_normalized.txt > first_normalized.txt
$ rfilter1 law.glm < second.txt > second_glm_normalized.txt
$ tr [A-Z] [a-z] < second_glm_normalized.txt > second_normalized.txt
#
# Run the scorer (similarity finder)
$ sclite -r first_normalized.txt -h second_normalized.txt -i rm -o snt stdout
===============================================================================

                     SENTENCE LEVEL REPORT FOR THE SYSTEM:
    Name: second_normalized.txt

===============================================================================


SPEAKER spk1
id: (spk1_1)
Scores: (#C #S #D #I) 12 0 2 0
REF:  HENRY C harper v the law offices of huey and luey limited liability partnership
HYP:  ***** * harper v the law offices of huey and luey limited liability partnership
Eval: D     D                                                                   

Correct               =   85.7%   12   ( 12)
Substitutions         =    0.0%    0   (  0)
Deletions             =   14.3%    2   (  2)
Insertions            =    0.0%    0   (  0)

Errors                =   14.3%    2   (  2)

Ref. words            =           14   ( 14)
Hyp. words            =           12   ( 12)
Aligned words         =           14   ( 14)

-------------------------------------------------------------------------------


$ # That `-i rm' has to do with that ID we talked about

So, there's a 14.3% Word Error Rate.

Now, let's look at a law case name that shouldn't match.

$ third="Larry Viola versus The Law Office of Mister Scrooge McDuck, Limited Liability Corporation (spk1_1)"
$ echo "${third}" > third.txt
$ rfilter1 law.glm < third.txt > third_glm_normalized.txt
$ tr [A-Z] [a-z] < third_glm_normalized.txt > third_normalized.txt
#
$ sclite -r first_normalized.txt -h third_normalized.txt -i rm -o snt stdout
$ sclite -r first_normalized.txt -h third_normalized.txt -i rm -o snt stdout    ===============================================================================

                     SENTENCE LEVEL REPORT FOR THE SYSTEM:
    Name: third_normalized.txt

===============================================================================


SPEAKER spk1
id: (spk1_1)
Scores: (#C #S #D #I) 6 7 1 0
REF:  HENRY C     HARPER v the law OFFICES of HUEY AND     LUEY   limited liability PARTNERSHIP
HYP:  ***** LARRY VIOLA  v the law OFFICE  of MR   SCROOGE MCDUCK limited liability CORPORATION
Eval: D     S     S                S          S    S       S                        S

Correct               =   42.9%    6   (  6)
Substitutions         =   50.0%    7   (  7)
Deletions             =    7.1%    1   (  1)
Insertions            =    0.0%    0   (  0)

Errors                =   57.1%    8   (  8)

Ref. words            =           14   ( 14)
Hyp. words            =           13   ( 13)
Aligned words         =           14   ( 14)

-------------------------------------------------------------------------------


$ # This one has a 57.1% Word Error Rate

You'll likely need to run some of your strings through the scoring (comparison) process to come up with a heuristic of where to separate True from False.

bballdave025
  • 1,347
  • 1
  • 15
  • 28