9

What is the shortest, most cross-platform way to make a std::unordered_set CASE-INSENSITIVE container?

my_set.insert("Apples");  
my_set.insert("apples"); //Insert doesn't occur because of duplicate item

I know STL provides Hash and Pred. What should Hash be? What should Pred be? if they are not-builtins then please provide the code for them along with an example of their use (i.e. how do I declare std::unordered_set?).

Due to the Criticism I will elaborate on what I am trying to do. I need a high performance transparent HTTP proxy server, one of the things it does is looks up HTTP header fields quickly. HTTP header fields are defined as being case-insensitive so I need a case-insensitive container.

unixman83
  • 9,421
  • 10
  • 68
  • 102

3 Answers3

14

The definition of unordered_set is

  template <class Value,
            class Hash = hash<Value>,
            class Pred = std::equal_to<Value>,
            class Alloc = std::allocator<Value> >
  class unordered_set;

If you provide Hash and Pred functors that are case-insensitive, then the set will become so too.

This is a simple example, the string hash function is simplistic but you can change it to your needs

struct MyHash
{
    size_t operator()(const std::string& Keyval) const
    {
        //You might need a better hash function than this
        size_t h = 0;
        std::for_each( Keyval.begin() , Keyval.end() , [&](char c )
        {
            h += tolower(c);
        });
        return h;
    }
};

struct MyEqual
{
    bool operator()(const std::string& Left, const std::string& Right) const
    {
        return Left.size() == Right.size() 
             && std::equal ( Left.begin() , Left.end() , Right.begin() ,
            []( char a , char b )
        {
            return tolower(a) == tolower(b); 
        }
        );
    }
};


int main()
{
    std::unordered_set< std::string , MyHash , MyEqual > m;

    m.insert( "Apple" );
    m.insert( "apple" );

    return 0;
}
Community
  • 1
  • 1
parapura rajkumar
  • 24,045
  • 1
  • 55
  • 85
  • And as an aside, you can do the same thing on the `set` class with the functor it uses to compare values. –  Dec 25 '11 at 01:00
  • 3
    Although you should stick to either `tolower` or `toupper` in both `MyHash` and `MyEqual` because they aren't always transitive. Better yet is to use proper case folding described in the Unicode Specification. – dalle Dec 25 '11 at 09:28
  • 2
    The trouble with that is the values are not case insensitive. If you iterate across the set will it return 'Apple' or 'apple'? – Martin York Dec 25 '11 at 10:15
  • 2
    @LokiAstari Whatever is put in first? the second `insert` would fail. – unixman83 Dec 25 '11 at 10:58
  • @unixman83: The second insert does not fail (fail indicates an error). It just does not change the set (as defined by the Pred). So it is not unfeasible (though I think unlikely) for an implementation to overwrite the original (first value). – Martin York Dec 25 '11 at 11:17
  • @dalle I modified my answer to use `tolower` throughout and it seems like std::equal would also warrant and size comparison first – parapura rajkumar Dec 25 '11 at 14:44
  • 4
    @LokiAstari: http://en.cppreference.com/w/cpp/container/unordered_set/insert "returns a pair consisting of a bool denoting whether the insertion took place and an iterator to the inserted element" – dalle Dec 25 '11 at 15:48
  • 2
    @dalle: Please don't quote that site at me (its horrible and has lots of inaccuracies and the developers opinion (which are not always correct)). What you want is n3242 `23.2.4 Associative containers [associative.reqmts]` Where you will find `Table 103 — Unordered associative container requirements (in addition to container)`. If you don't have a copy of the standard please get one: http://stackoverflow.com/a/4653479/14065 – Martin York Dec 25 '11 at 20:21
  • @LokiAstari: Good, then you can read it yourself. "Returns: The bool component of the returned pair object indicates whether the insertion took place and the iterator component points to the element with key equivalent to the key of value_type(obj)." – dalle Dec 25 '11 at 21:06
  • @LokiAstari: And the table you are referring to states "Effects: Inserts t if and only if there is no element in the container with key equivalent to the key of t. The bool component of the returned pair indicates whether the insertion takes place, and the iterator component points to the element with key equivalent to the key of t." which answers your own question. – dalle Dec 25 '11 at 21:18
  • @dalle: Yes. I was indicating you were correct. Just using a crappy source. – Martin York Dec 25 '11 at 22:46
2

Personally i would define a value type that was case insensitive and that devolves into a string at the merest hint. Thus allowing me to use the standard hash and predicate models.

#include <string>
#include <unordered_set>
#include <iostream>
#include <algorithm>
#include <iterator>

class LCString
{
    std::string  data;

    public:
        operator std::string&()             {return data;}
        operator std::string const&() const {return data;}

        LCString(char const* init)
        {
            std::transform(init, init + strlen(init),
                           std::back_inserter(data), &::tolower);
        }
};

int main()
{
    typedef std::unordered_set<LCString, 
                               std::hash<std::string>, 
                               std::equal_to<std::string> >  MySet;
    MySet  data;
    data.insert("Apples");
    data.insert("apples");

    std::copy(data.begin(), data.end(),
              std::ostream_iterator<std::string>(std::cout, " - "));
    std::cout << "\n";
}

Thus we only put lowercase values into the set:

> g++ pl.cpp
> ./a.out
apples -
>

Edit Case Preserving:

class LCStringOriginalPreserved
{
    std::string  original;
    std::string  data;

    public:
        operator std::string&()             {return data;}
        operator std::string const&() const {return data;}

        std::string& getOriginal()          {return original;}

        LCString(char const* init)
          : original(init)
        {
            std::transform(original.begin(), original.end(),
                           std::back_inserter(data), &::tolower);
        }
};
Martin York
  • 257,169
  • 86
  • 333
  • 562
  • This is a good idea, but what if I need Case-Preserving / Case-Insensitive behavior. It doesn't preserve the case. – unixman83 Dec 25 '11 at 10:54
  • @unixman83: If you go: `data.insert("Apple");data.insert("apple");` Which version do you expect to be in the set? You now can't have case preserving as a random one will be lost. So the concept of case preserving on a case insensitive set does not make sense. Now if you had said unordered_multiset that would have made sense. But the above is easily extended. Just add another field to LCString to store the original value so you can retrieve it explicitly. – Martin York Dec 25 '11 at 10:57
  • 1
    I expect the **first one I inserted** to be in the set, the one with the desired case. – unixman83 Dec 25 '11 at 11:50
  • @unixman83: It is what you will probably get in most situations, but there is nothing in the standard that will guarantee that with anything presented here. To provide that guarantee you will have jump through a couple more hoops. But as I said previously that requirement does not make sense. – Martin York Dec 25 '11 at 12:03
  • 1
    @LokiAstari Huh? If he just uses a case-insensitive comparator he is guartanteed to have the first insert in the set, because the second one does not insert anything, because well, it is already in the set. Perfectly standard-defined behaviour. – Christian Rau Dec 25 '11 at 14:48
  • @ChristianRau: OK I looked it up, yes that is the behavior (though I don't think it's obvious that it should be defined like that and I could see implementations define insert in terms of operator[]). But I am glad it is defined that way.`n3242 23.2.4 Associative containers [associative.reqmts]` – Martin York Dec 25 '11 at 20:28
-1

I like this better.

Works on Linux.

#include <strings.h>
#include <ctype.h>

#include <string>
#include <functional>
#include <tr1/functional_hash.h>

struct iequal_to : public std::binary_function <std::string,std::string,bool>
{
  bool operator() (const std::string& x, const std::string& y) const
  {
    return (!strcasecmp(x.c_str(), y.c_str()));
  }
};

const std::string LC(const std::string& x)
{
  std::string ret(x);
  std::string::size_type i;
  for(i = 0; i < x.size(); ++i)
    ret[i] = tolower(x[i]);
  return ret;
}

struct ihash : public std::unary_function <std::string,size_t>
{
  size_t ihash::operator() (const std::string& x) const
  {
    return std::tr1::hash<std::string>()(LC(x));
  }
};
unixman83
  • 9,421
  • 10
  • 68
  • 102
  • Better than what? This is not an answer! Isn't this just a rephrased version of parapura's answer? Completely obsolete! What is `strcasecmp`? And do you really create a new string everytime the hash function is called? – Christian Rau Dec 25 '11 at 09:14
  • `strcasecmp` is standard in Linux's glibc C library (see `stricmp`). – unixman83 Dec 25 '11 at 11:49