0

The following question was worded such that I thought it would get me what I need:

Case insensitive std::set of strings

Note that they mention case insensitive search as well as insertion. However, when I implement the answer using a custom comparator, I get case insensitive insertion but not search. Maybe I'm misunderstanding what is referred to as "search?" Here's a minimal example of what I'm seeing:

#include <iostream>
#include <string>
#include <unordered_set>

using namespace std;

struct InsensitiveCompare {
  bool
  operator()(std::string_view a, std::string_view b) const
  {
    return std::equal(
            a.begin(),
            a.end(),
            b.begin(),
            b.end(),
            [](char a, char b) {
                return tolower(a) == tolower(b);
            }   
        );  
  }
};

std::unordered_set<std::string, std::hash<std::string>, InsensitiveCompare> s;


int
main(int argc, char* argv[])
{
    s.insert("one");
    s.insert("oNe");
    s.insert("tWo");

    cout << "size: " << s.size() << endl;

    auto found = s.find("TWO");
    cout << "found(TWO): " << (found != s.end()) << endl;
    found = s.find("tWo");
    cout << "found(tWo): " << (found != s.end()) << endl;

    return 0;
}

This produces the following output:

g++ -g -Wall -Werror -std=c++17 test.cc -o test
./test
size: 2
found(TWO): 0
found(tWo): 1

Notice that find is behaving in a case sensitive manner: tWo matches but TWO doesn't. I desire both to match. What are my options for making find case insensitive.

firebush
  • 5,180
  • 4
  • 34
  • 45

2 Answers2

5

The equality operation is only used when the hashes compare equal. If the hashes don't compare equal, then the two strings won't be checked for exact equality. Otherwise, we'd just have to check every hash bucket - who cares if the hashes match or not? That would defeat the constant look up time of the hashed set.

You'll need to make a custom hash function that ignores case.

Given a hash set S, its hash function H(x), its equality operation x == y, and the hash equality H(x) == H(y): For all x and y that are valid elements of S, if x == y, then it must also be true that H(x) == H(y), otherwise S is said to be "totally jank".

JohnFilleau
  • 4,045
  • 1
  • 15
  • 22
1

Using the answer from @JohnFilleau, I created a custom hasher. This seems to do what I want now:

#include <algorithm>
#include <cctype>
#include <iostream>
#include <string>
#include <unordered_set>

using namespace std;

struct InsensitiveCompare {
  bool
  operator()(std::string_view a, std::string_view b) const
  {
    return std::equal(
            a.begin(),
            a.end(),
            b.begin(),
            b.end(),
            [](char a, char b) {
                return tolower(a) == tolower(b);
            }   
        );  
  }
};

struct StringHashByLower {
    public:
    size_t operator()(const std::string & str) const {
        std::string lower;
        std::transform(str.begin(), str.end(), lower.begin(),
                [](unsigned char c) -> unsigned char { return std::tolower(c); }); 
        return std::hash<std::string>()(lower);
    }   
};


std::unordered_set<std::string, StringHashByLower, InsensitiveCompare> s;


int
main(int argc, char* argv[])
{
    s.insert("one");
    s.insert("oNe");
    s.insert("tWo");

    cout << "size: " << s.size() << endl;

    auto found = s.find("TWO");
    cout << "found(TWO): " << (found != s.end()) << endl;
    found = s.find("tWo");
    cout << "found(tWo): " << (found != s.end()) << endl;

    return 0;
}

This now does what I want:

g++ -g -Wall -Werror -std=c++17 test.cc -o test
./test
size: 2
found(TWO): 1
found(tWo): 1
firebush
  • 5,180
  • 4
  • 34
  • 45