Providing a good hash function

Question

In my initial question (detailed experimental investigation): Appropriate container for the fast insertion and lookup of n-dimensional real vectors (initial benchmarking provided) I got really strange behaviour using the unsorted set for management of random N-dimensional float arrays with my initial (likely poor designed Hash function):

#include <iostream>
#include <chrono>
#include <random>
#include <array>
#include <unordered_set>

const int N = 3;  // Dimensionality of the arrays

std::array<double, N> getRandomArray() {
  // Engines and distributions retain state, thus defined as static
  static std::default_random_engine e;                    // engine
  static std::uniform_real_distribution<double> d(0, 1);  // distribution
  std::array<double, N> ret;
  for (size_t i = 0; i < N; ++i) {
    ret[i] = d(e);
  }
  return ret;
}

// Return Squared Euclidean Distance
template <typename InputIt1, typename InputIt2>
double EuclideanDistance2(InputIt1 beg1, InputIt1 end1, InputIt2 beg2) {
  double val = 0.0;
  while (beg1 != end1) {
    double dist = (*beg1++) - (*beg2++);
    val += dist*dist;
  }
  return val;
}

struct ArrayHash {  // Hash Function
  std::size_t operator() (const std::array<double, N>& arr) const {
    std::size_t ret = 0;
    for (const double elem : arr) {
      ret += std::hash<double>()(elem);
    }
    return ret;
  }
};

struct ArrayEqual {  // Equivalence Criterion
  bool operator() (const std::array<double, N>& arr1,
                          const std::array<double, N>& arr2) const {
    return EuclideanDistance2(arr1.begin(), arr1.end(), arr2.begin()) < tol*tol;
  }
 private:
  static constexpr double tol = 1e-6;  // Comparison tolerance
};


int main() {
  // create a unordered set of double arrays (usda)
  std::unordered_set<std::array<double, N>, ArrayHash, ArrayEqual> usda;
  // record start time
  auto start = std::chrono::steady_clock::now();
  // Generate and insert one hundred thousands new double arrays
  for (size_t i = 0; i < 100000; ++i) {
    // Get a new random double array (da)
    std::array<double, N> da = getRandomArray();
    usda.insert(da);
  }
  // record finish time
  auto end = std::chrono::steady_clock::now();
  std::chrono::duration<double> diff = end - start;
  std::cout << "Time to generate and insert unique elements into UNORD. SET: "
            << diff.count() << " s\n";
  std::cout << "unord. set size() = " << usda.size() << std::endl;
  return 0;
}

Two the most strange things are:

Running experiments without any optimization flags even with loose tolerance (1e-1) almost all random vectors (implemented as N-dimensional arrays) were identified as unique. I haven't observed this using vectors and sets.
While turning on -O3 optimization flag, the number of unique elements significantly differs from the numbers without optimization and this is for sure states that there's something wrong with my approach.

Edit: 2-nd problem solved taking into account @WhozCraig remark.

So, my question is: is this strange behaviour because my hash function is badly designed? If so, can you please suggest how to make a better hash function for my case?

Have you checked what the actual hash value of a given array is? Does it change when built with -O3? Have you looked at the distribution of hash values for some manageable number of random arrays? Does -O3 select a different default random engine? — Useless, Aug 02 '16 at 09:27
`std::size_t ret;` you realize that thing is *indeterminate* ? Thus the follow-up looped `ret += std::hash()(elem);` is pretty much worthless. So, yeah, I'd say it's badly designed. — WhozCraig, Aug 02 '16 at 09:29
@WhozCraig Thanks, this was really the case (shame on me)! After initializing `std::size_t ret = 0` I got rid of 2-nd strange thing. — Remis, Aug 02 '16 at 09:42

score 0 · Answer 1 · answered Aug 02 '16 at 19:57

0

Your program exhibits undefined behavior (uninitialized std::size_t ret variable aside).

First, ArrayEqual does not induce an equivalence relationship. It's not transitive - there exist three arrays a, b and c such that a is "close enough" to b and b is "close enough" to c, but a is not "close enough" to c.

Second, ArrayHash may not return the same hash value for two arrays that ArrayEqual declares equal.

Both of these are prerequisites on the template arguments of std::unordered_set.

answered Aug 02 '16 at 19:57

Igor Tandetnik

50,461
4
56
85

Thanks, I fully agree with both your remarks. The practical problem now is how to fix them. – Remis Aug 02 '16 at 20:13
You need to define your equivalence classes, and stick to them. E.g. you could divide the space into an N-dimensional grid, `1e-6` units on a side. All points that fall into the same cell would be considered equivalent by `ArrayEqual`. Pick a representative for the class (e.g. the center of the cell, or the corner with the smallest coordinates), and have `ArrayHash` return a hash of that representative for each point in the cell. – Igor Tandetnik Aug 02 '16 at 20:54

Providing a good hash function

1 Answers1