9

I just realized that implementing the following algorithm to compute the hash code for a stream is not possible using Stream.reduce(...). The problem is that the initial seed for the hash code is 1 which is not an identity for the accumulator.

The algorithm for List.hashCode() :

int hashCode = 1;
for (E e : list)
  hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());

You might be tempted to think that the following is correct but it isn't, although it will work if the stream processing is not split up.

List<Object> list = Arrays.asList(1,null, new Object(),4,5,6);
int hashCode = list.stream().map(Objects::hashCode).reduce(1, (a, b) -> 31 * a + b);

It seems that the only sensible way of doing it would be to get the Iterator of the Stream and do normal sequential processing or collect it to a List first.

Roland
  • 7,525
  • 13
  • 61
  • 124
  • 5
    The question is why you'd want to calculate the `hash code for a stream`. What are you going to do with it? – Eran Sep 08 '16 at 08:18
  • 3
    Sounds like an [XY problem](http://xyproblem.info/). Doesn't the *complex datastructure* implements `hashCode()` ? If not, aren't you allowed to implement it ? Do you really need to compute this hash in parallel on the stream so that the `reduce` isn't relevant ? Maybe as a last resort, have you considered `stream.collect(Collectors.toList()).hashCode()` ? – Spotted Sep 08 '16 at 09:15
  • 2
    So if I understand correctly, you want to know if 2 objects are equals based on only a part of their state that has been extracted/transformed by some `Functions` ? – Spotted Sep 08 '16 at 09:45
  • 2
    In the light of this new problematic, can you also update your question so that more people can more easily help you ? – Spotted Sep 08 '16 at 11:28
  • 4
    @Eran: because we can. Well, at least I enjoyed solving the problem. ;^) And, who knows, perhaps at the future, someone really has an extremely large list to hash in parallel… – Holger Sep 08 '16 at 17:09

5 Answers5

14

While, at the first glance, the hash code algorithm seems to be non-parallelizable due to its non-associativity, it is possible, if we transform the function:

((a * 31 + b) * 31 + c ) * 31 + d

to

a * 31 * 31 * 31 + b * 31 * 31 + c * 31 + d

which basically is

a * 31³ + b * 31² + c * 31¹ + d * 31⁰

or for an arbitrary List of size n:

1 * 31ⁿ + e₀ * 31ⁿ⁻¹ + e₁ * 31ⁿ⁻² + e₂ * 31ⁿ⁻³ +  …  + eₙ₋₃ * 31² + eₙ₋₂ * 31¹ + eₙ₋₁ * 31⁰

with the first 1 being the initial value of the original algorithm and eₓ being the hash code of the list element at index x. While the summands are evaluation order independent now, there’s obviously a dependency to the element’s position, which we can solve by streaming over the indices in the first place, which works for random access lists and arrays, or solve generally, with a collector which tracks the number of encountered objects. The collector can resort to the repeated multiplications for the accumulation and has to resort to the power function only for combining results:

static <T> Collector<T,?,Integer> hashing() {
    return Collector.of(() -> new int[2],
        (a,o)    -> { a[0]=a[0]*31+Objects.hashCode(o); a[1]++; },
        (a1, a2) -> { a1[0]=a1[0]*iPow(31,a2[1])+a2[0]; a1[1]+=a2[1]; return a1; },
        a -> iPow(31,a[1])+a[0]);
}
// derived from http://stackoverflow.com/questions/101439
private static int iPow(int base, int exp) {
    int result = 1;
    for(; exp>0; exp >>= 1, base *= base)
        if((exp & 1)!=0) result *= base;
    return result;
}

 

List<Object> list = Arrays.asList(1,null, new Object(),4,5,6);
int expected = list.hashCode();

int hashCode = list.stream().collect(hashing());
if(hashCode != expected)
    throw new AssertionError();

// works in parallel
hashCode = list.parallelStream().collect(hashing());
if(hashCode != expected)
    throw new AssertionError();

// a method avoiding auto-boxing is more complicated:
int[] result=list.parallelStream().mapToInt(Objects::hashCode)
    .collect(() -> new int[2],
    (a,h)    -> { a[0]=a[0]*31+h; a[1]++; },
    (a1, a2) -> { a1[0]=a1[0]*iPow(31,a2[1])+a2[0]; a1[1]+=a2[1]; });
hashCode = iPow(31,result[1])+result[0];

if(hashCode != expected)
    throw new AssertionError();

// random access lists allow a better solution:
hashCode = IntStream.range(0, list.size()).parallel()
    .map(ix -> Objects.hashCode(list.get(ix))*iPow(31, list.size()-ix-1))
    .sum() + iPow(31, list.size());

if(hashCode != expected)
    throw new AssertionError();
Holger
  • 285,553
  • 42
  • 434
  • 765
  • Awesome!!! I like the trick of creating an `Array` to store information in the `Collector` – Roland Sep 09 '16 at 07:20
  • 3
    @Roland: the array is actually just a work-around for the absence of a `pair` or `tuple` type. It’s the general concept of collectors, to have a mutable container type. Even the built-in collectors use single-element arrays whens they need a mutable `int` or `long` container. – Holger Sep 09 '16 at 09:20
  • @Holger In the non-autoboxing version, replacing `Objects.hashCode(o)` with just `o` results in a substantial speed increase. This is safe because you've already called `mapToInt(Objects::hashCode)` – Clement Cherlin Jan 11 '23 at 17:09
  • 2
    @ClementCherlin yes, you are right. Calling `Objects.hashCode(…)` at this point subverts the entire idea of avoiding auto-boxing with this alternative. I suppose, this slipped in when deriving this alternative from the `Collector` solution. Fixed. – Holger Jan 11 '23 at 17:20
  • @Holger Another interesting optimization: In iPow, if you remove the `base` parameter and insert the two lines `int base = 31; exp &= 0x7FFFFFF;` at the beginning, the calculation works for streams of unlimited size. Otherwise, iPow breaks when the exponent exceeds Integer.MAX_VALUE and wraps around. The optimization works because `(31 ** (2 ** 27)) modulo (2 ** 32) = 1` – Clement Cherlin Jan 11 '23 at 18:33
3

As a first approach I would use the collect-to-a-list solution as long as you don't have performance concerns. That way you avoid reimplementing the wheel and if one day the hash algorithm changes you benefit from that and you are also safe if the stream is parallelized (even if I'm not sure that's a real concern).

The way I would implement it can vary depending on how and when you need to compare your different datastructures (let's call it Foo).

If you do it manually and sparsly a simple static function may be enough:

public static int computeHash(Foo origin, Collection<Function<Foo, ?>> selectors) {
    return selectors.stream()
            .map(f -> f.apply(origin))
            .collect(Collectors.toList())
            .hashCode();
}

And use it like this

if(computeHash(foo1, selectors) == computeHash(foo2, selectors)) { ... }

However, if instances of Foo are themselves stored in Collection and you need both hashCode() and equals() (from Object) to be implemented, I would wrap it inside a FooEqualable:

public final class FooEqualable {
    private final Foo origin;
    private final Collection<Function<Foo, ?>> selectors;

    public FooEqualable(Foo origin, Collection<Function<Foo, ?>> selectors) {
        this.origin = origin;
        this.selectors = selectors;
    }

    @Override
    public int hashCode() {
        return selectors.stream()
                .map(f -> f.apply(origin))
                .collect(Collectors.toList())
                .hashCode();
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof FooEqualable) {
            FooEqualable that = (FooEqualable) obj;

            Object[] a1 = selectors.stream().map(f -> f.apply(this.origin)).toArray();
            Object[] a2 = selectors.stream().map(f -> f.apply(that.origin)).toArray();

            return Arrays.equals(a1, a2);
        }
        return false;
    }
}

I'm fully aware that this solution isn't optimized (performance-wise) if multiple calls to hashCode() and equals() are made but I tend not to optimize except if it becomes a concern.

Spotted
  • 4,021
  • 17
  • 33
2

Holger wrote the right solution, if you want a simple way of doing it there are two additional possibilities:

1. collect to List and call hashCode()

Stream<? extends Object> stream;
int hashCode = stream.collect(toList()).hashCode();

2. use Stream.iterator()

Stream<? extends Object> stream;
Iterator<? extends Object> iter = stream.iterator();
int hashCode = 1;
while(iter.hasNext()) {
  hashCode = 31 *hashCode + Objects.hashCode(iter.next());
}

Just as a reminder the algorithm that List.hashCode() uses:

int hashCode = 1;
for (E e : list)
  hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
Roland
  • 7,525
  • 13
  • 61
  • 124
0

The easiest and shortest way I found was to implement a Collector using Collectors.reducing:

/**
 * Creates a new Collector that collects the hash code of the elements.
 * @param <T> the type of the input elements
 * @return the hash code
 * @see Arrays#hashCode(java.lang.Object[])
 * @see AbstractList#hashCode()
 */
public static <T> Collector<T, ?, Integer> toHashCode() {
    return Collectors.reducing(1, Objects::hashCode, (i, j) -> 31 *  i + j);
}

@Test
public void testHashCode() {
    List<?> list = Arrays.asList(Math.PI, 42, "stackoverflow.com");
    int expected = list.hashCode();
    int actual = list.stream().collect(StreamUtils.toHashCode());
    assertEquals(expected, actual);
}
simon04
  • 3,054
  • 29
  • 25
  • This is incorrect because 1 is not an identity for this reduction. It will break if the processing is parallelized. – Roland Jun 11 '20 at 11:13
-1

If parallel processing is not a hard requirement, the answer is very simple. You don't need to do anything special to ensure correct ordering for a List, or any other type that has an ORDERED Spliterator such as LinkedHashSet.

From the JavaDoc for Spliterator.ORDERED:

API Note:
Encounter order is guaranteed to be ascending index order for any List. But no order is guaranteed for hash-based collections such as HashSet. Clients of a Spliterator that reports ORDERED are expected to preserve ordering constraints in non-commutative parallel computations.

Since the encounter order is already guaranteed, all you need to do is ensure your Stream is sequential. This requires no effort in most cases, since Collection.stream() returns a sequential Stream. Thus, this code is already correct:

List<Object> list = Arrays.asList(1, null, new Object(), 4, 5, 6);
int hashCode = list.stream()
    .mapToInt(Objects::hashCode)
    .reduce(1, (a, b) -> 31 * a + b);
Clement Cherlin
  • 387
  • 6
  • 13
  • 1
    Stream processing may be parallelized by the jvm therefore it is a *hard* requirement and your solution is buggy. – Roland Oct 11 '22 at 11:54
  • @Roland "Stream processing may be parallelized by the jvm" is incorrect. Stream processing is not parallelized for sequential streams. The JavaDoc for `List.stream()` states "Returns: a sequential `Stream` over the elements in this collection". If you want a parallel stream you must call `List.parallelStream()`, which returns "a possibly parallel `Stream` over the elements in this collection". – Clement Cherlin Nov 10 '22 at 21:10
  • 3
    Point taken. Still your solution is not good since the reduce should receive an identity value as the first argument, which is not the case. If you change the code to use a parallel stream it will break. This is brittle code. – Roland Nov 22 '22 at 14:34