1

While working on a project, I ran into the following issue I could not explain to myself.

I have the following is_in_set(..) function, which simply checks if a cstring is in an unordered_set of cstrings:

bool is_in_set(const char * str, std::unordered_set<const char *> the_set)
{
    if ( the_set.find( str ) != the_set.end() )
        return true;
    else
        return false;
}

And then I created the following sample main method to demonstrate my problem:

int main()
{
    std::unordered_set<const char *> the_set({"one",
        "two", "three", "four", "five"});

    std::string str = "three";
    const char * cstr = "three";

    std::cout << "str in set? "
        << is_in_set( str.c_str() , the_set ) << std::endl
        << "cstr in set? " 
        << is_in_set( cstr, the_set ) << std::endl;

    const char * str_conv = str.c_str();

    std::cout << "str_conv in set? "
        << is_in_set( str_conv , the_set ) << std::endl
        << "strcmp(str_conv, cstr) = " << strcmp( str_conv , cstr )
        << std::endl;

    return 0;
}

I expected the above code to find the std::string casted to const char*, as well as the cstring in the set. Instead of that, it generates the following output (Visual Studio Community 2017):

str in set? 0
cstr in set? 1
str_conv in set? 0
strcmp(str_conv, cstr) = 0

I also ran a for-loop over both variables, outputting byte by byte (in hexadecimal representation) for each, which results in the following:

74 68 72 65 65 00 = c_str
74 68 72 65 65 00 = str_conv

Why is the std::string casted to const char * not being found in the set? Shouldn't strcmp return a value different from 0 in this case?

nacho3
  • 23
  • 2
  • 2
    If you have two copies of the same string in different places, the addresses obviously will be different. You can't compare C strings by comparing addresses. Why don't you make a set of `std::strings`? If you really want it to work, you have to provide a custom comparator. Read `unordered_set` docs to learn how. – HolyBlackCat Mar 14 '18 at 16:07
  • 2
    Note that you were able to find `cstr` in your set because your compiler realized that two string literals have the same value and merged them into one. This behaviour is not portable. – HolyBlackCat Mar 14 '18 at 16:09
  • 1
    `std::unordered_set::find` searches for matching `const char*` values without looking at what values they are pointing to. If you want custom comparison to be done, consider passing custom predicates for key equality, and hashing, to `std::unordered_set`, when constructing it. – Algirdas Preidžius Mar 14 '18 at 16:11

3 Answers3

2

For const char *, there is no overload of the == operator that compares strings by value, so I believe the unordered_set container will always compare pointers, not the values of the pointed-to strings.

The compiler can, as an optimization, make multiple string literals with the same characters use the same memory location (and thus have identical pointers), which is why you're able to find the string when you use another string literal. But any string you construct by some other mechanism, even if it contains the same characters, won't be at the same memory location, and thus the pointers will not be equal.

Daniel Pryden
  • 59,486
  • 16
  • 97
  • 135
  • Great, thank you! For some stupid reason I thought the unordered_set would come with up a smart solution for comparing the two on its own.. :) – nacho3 Mar 14 '18 at 16:18
1

Use std::unordered_set<std::string> or provide custom hasher if you sure that your strings will not left the stack while you are using the hashtable e.g. static variables or allocated with new/malloc etc.

Something like:

struct str_eq {
  bool opeator()(const char* lsh, const char rhs) const noexcept
  {
    return lsh == rhs || 0 == std::strcmp(lsh, rhs);
  }  
};


struct str_hash {
   std::size_t opeator()(const char* str) const noexcept
   {
     // some mur-mur2, google cityhash hash_bytes etc instead of this
      return std::hash<std::string>( std::string(str) ) ();
   }
};

typedef std::unordered_set<const char*, str_hash, str_eq, std::allocator<const char*> > my_string_hashset;
Victor Gubin
  • 2,782
  • 10
  • 24
1

As @Daniel Pryden pointed out, you are doing address comparisons. To fix this you will need to either have the unordered_set store std::string objects, or create a custom comparison for the unordered_set to use.

Based on the answer to a related question, something like this:

struct StringEqual
{
    bool operator()(const char* a, const char* b) { return 0 == strcmp(a,b); }
};

std::unordered_set<const char *, std::Hash<const char*>, StringEqual> the_set(
    {"one", "two", "three", "four", "five"});

should do the trick. This gives the unordered_set a better operator to use for testing the strings.

For more information about the Pred template parameter, see the documentation.

Kyle A
  • 928
  • 7
  • 17