0

What is the easiest way, with the least amount of code, to compare two strings, while ignoring the following:

"hello  world" == "hello world"                   // spaces
"hello-world"  == "hello world"                   // hyphens
"Hello World"  == "hello worlD"                   // case
"St pierre"    == "saint pierre" == "St. Pierre"  // word replacement

I'm sure this has been done before, and there are some libraries to do this kind of stuff, but I don't know any. This is in C++ preferably, but if there's a very short option in whatever other language, I'll want to hear about it too.

Alternatively, I'd also be interested in any library that could give a percentage of matching. Say, hello-world and hello wolrd are 97% likely to be the same meaning, just a hyphen and a mispelling.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
shkra19
  • 309
  • 2
  • 4
  • 13
  • 2
    Too many cases here.. you need to figure out what you want and all the rules. –  May 08 '13 at 17:09
  • 3
    I think you're question is a bit too broad here. If you have a specific problem with your implementation then we can help with that, but as it stands you're basically asking the SO community to write your whole program for you. – Chris May 08 '13 at 17:09
  • "Word Replacement" is something one writes books on, not a paragraph. – Mooing Duck May 08 '13 at 17:16

5 Answers5

2
  1. Remove spaces from both strings.
  2. Remove hyphens from both strings.
  3. Convert both strings to lower case.
  4. Convert all occurrences of “saint” and “st.” to “st”.
  5. Compare strings like normal.

For example:

#include <cctype>
#include <string>
#include <algorithm>
#include <iostream>

static void remove_spaces_and_hyphens(std::string &s)
{
    s.erase(std::remove_if(s.begin(), s.end(), [](char c) {
                return c == ' ' || c == '-';
            }), s.end());
}

static void convert_to_lower_case(std::string &s)
{
    for (auto &c : s)
        c = std::tolower(c);
}

static void
replace_word(std::string &s, const std::string &from, const std::string &to)
{
    size_t pos = 0;
    while ((pos = s.find(from, pos)) != std::string::npos) {
        s.replace(pos, from.size(), to);
        pos += to.size();
    }
}

static void replace_words(std::string &s)
{
    replace_word(s, "saint", "st");
    replace_word(s, "st.", "st");
}

int main()
{
    // Given two strings:
    std::string s1 = "Hello, Saint   Pierre!";
    std::string s2 = "hELlO,St.PiERRe!";

    // Remove spaces and hyphens.
    remove_spaces_and_hyphens(s1);
    remove_spaces_and_hyphens(s2);

    // Convert to lower case.
    convert_to_lower_case(s1);
    convert_to_lower_case(s2);

    // Replace words...
    replace_words(s1);
    replace_words(s2);

    // Compare.
    std::cout << (s1 == s2 ? "Equal" : "Doesn't look like equal") << std::endl;
}

There is a way, of course, to code this more efficiently, but I recommend you start with something working and optimize it only when it proves to be a bottleneck.

It also sounds like you might be interested in string similarity algorithms like “Levenshtein distance”. Similar algorithms are used, for example, by search engine or editors to offer suggestion on spell correction.

Community
  • 1
  • 1
  • 1
    Works in the specific case, but a more accurate algorithm should be "replace whatever ' ',\t,\n sequence with a single ' '", instead of "remove all spaces". If you remove them all, you can match what is required to remain distinct. – Emilio Garavaglia May 08 '13 at 17:49
1

I dont know any library, but for equlity, if speed is not rpoblem, you can do char-by-char compare and ignore "special" characters (respectively move iterator further in text).

As for comparing texts, you can use simple Levenshtein distance.

Martin Perry
  • 9,232
  • 8
  • 46
  • 114
1

For spaces and hyphens, just replace all spaces/hyphens in the string and do a comparison. For case, convert all text to upper or lower case and do the comparison. For word replacement, you would need a dictionary of words with the key being the abbreviation and the value being the replacement word. You may also consider using the Levenshtein Distance algorithm for showing how similar one phrase is to another. If you want statistical probablility of how close a word/phrase is to another word/phrase, you will need sample data to do a comparison.

Cameron Tinker
  • 9,634
  • 10
  • 46
  • 85
0

QRegExp is what you are looking for. It won't print out the percentages, but you can make some pretty slick ways of comparing one string to another, and finding the number of matches of one string to another.

Regular Expressions are available with almost ever language out there. I like GSkinner's RegEx page for learning regular expressions.

http://qt-project.org/doc/qt-4.8/qregexp.html

Hope that helps.

phyatt
  • 18,472
  • 5
  • 61
  • 80
0

for the first 3 requirments,

  1. remove all spaces/hypens of string (or replace it to a char, e.g'') "hello world" --> "helloworld"
  2. compare them ignore case. Case insensitive string comparison in C++

for the last requirment, it is more compliate.
first you need a dictionary, which in KV structure:
'St.': 'saint'
'Mr.': 'mister'

second use boost token to seperate the string, and fetch then in the KV Store
then replace the token to the string, but it may in low performance:

http://www.boost.org/doc/libs/1_53_0/libs/tokenizer/tokenizer.htm

Community
  • 1
  • 1
Freeman Zhang
  • 163
  • 1
  • 3
  • 12
  • 1
    Removing characters will be slower than iterating text char-by-char and compare chars with respect to ignored characters. – Martin Perry May 08 '13 at 17:14