1

I have stream of files, and a method which takes two files as an argument, and return if they have same content or not.

I want to reduce this stream of files to a set (or map) of sets grouping all the files with identical content.

I know this is possible by refactoring the compare method to take one file, returning a hash and then grouping the stream by the hash returned by the function given to the collector. But what is the cleanest way to achieve this with a compare method, which takes two files and returns a boolean?

For clarity, here is an example of the obvious way with the one argument function solution

file.stream().collect(groupingBy(f -> Utility.getHash(f))

But in my case I have the following method which I want to utilize in the partitioning process

public boolean isFileSame(File f, File f2) {
    return Files.equal(f, f2)
}
Holger
  • 285,553
  • 42
  • 434
  • 765
Tuomas Toivonen
  • 21,690
  • 47
  • 129
  • 225
  • Possible duplicate of [How to partition a list by predicate using java8?](https://stackoverflow.com/questions/36678571/how-to-partition-a-list-by-predicate-using-java8) – Oleg Sep 30 '17 at 08:03
  • Do you even bother to read the post? This is not possible duplicate, I'm clearly asking if the partitioning is possible with function, which takes two elements and returns a boolean. – Tuomas Toivonen Sep 30 '17 at 12:01
  • Why are people downvoting this? – Wim Deblauwe Sep 30 '17 at 12:05
  • 1
    Yes I read your post, did you even bother to read the duplicate? The only difference is that you ask about BiPredicate which only creates more problems. @WimDeblauwe Don't know about the other person I downvoted because it's a duplicate. – Oleg Sep 30 '17 at 14:43
  • Yes, and in my post I described the accepted answer of that post as one possible solution for the problem. But this solution isn't the answer for my question, as I'm exactly asking about *BiPredicate* to do the partitioning. I want to know if the API even supports this, or is this *Function* only way to go. I.e if I have rather complicated method implementing *BiPredicate*, I'm out of luck regarding using it to partition the list in this context – Tuomas Toivonen Sep 30 '17 at 16:41

2 Answers2

2

If all you have is a BiPredicate without an associated hash function that would allow an efficient lookup, you can only use a linear probing. There is no builtin collector doing that, but a custom collector working close to the original groupingBy collector can be implemented like

public static <T> Collector<T,?,Map<T,Set<T>>> groupingBy(BiPredicate<T,T> p) {
    return Collector.of(HashMap::new,
        (map,t) -> {
            for(Map.Entry<T,Set<T>> e: map.entrySet())
                if(p.test(t, e.getKey())) {
                    e.getValue().add(t);
                    return;
                }
            map.computeIfAbsent(t, x->new HashSet<>()).add(t);
        }, (m1,m2) -> {
            if(m1.isEmpty()) return m2;
            m2.forEach((t,set) -> {
                for(Map.Entry<T,Set<T>> e: m1.entrySet())
                    if(p.test(t, e.getKey())) {
                        e.getValue().addAll(set);
                        return;
                    }
                m1.put(t, set);
            });
            return m1;
        }
    );

but, of course, the more resulting groups you have, the worse the performance will be.

For your specific task, it will be much more efficient to use

public static ByteBuffer readUnchecked(Path p) {
    try {
        return ByteBuffer.wrap(Files.readAllBytes(p));
    } catch(IOException ex) {
        throw new UncheckedIOException(ex);
    }
}

and

Set<Set<Path>> groupsByContents = your stream of Path instances
    .collect(Collectors.collectingAndThen(
        Collectors.groupingBy(YourClass::readUnchecked, Collectors.toSet()),
        map -> new HashSet<>(map.values())));

which will group the files by contents and does hashing implicitly. Keep in mind that equal hash does not imply equal contents but this solution does already take care of this. The finishing function map -> new HashSet<>(map.values()) ensures that the resulting collection does not keep the file’s contents in memory after the operation.

Holger
  • 285,553
  • 42
  • 434
  • 765
1

A possible solution by the helper class Wrapper:

files.stream()
    .collect(groupingBy(f -> Wrapper.of(f, Utility::getHash, Files::equals)))
    .keySet().stream().map(Wrapper::value).collect(toList());

If you won't to use the Utility.getHash for some reason, try to use File.length() for the hash function. The Wrapper provides a general solution to customize the hash/equals function for any type (e.g. array). it's useful to keep it into your tool kit. Here is the sample implementation for the Wrapper:

public class Wrapper<T> {
    private final T value;
    private final ToIntFunction<? super T> hashFunction;
    private final BiFunction<? super T, ? super T, Boolean> equalsFunction;
    private int hashCode;

    private Wrapper(T value, ToIntFunction<? super T> hashFunction, BiFunction<? super T, ? super T, Boolean> equalsFunction) {
        this.value = value;
        this.hashFunction = hashFunction;
        this.equalsFunction = equalsFunction;
    }
    public static <T> Wrapper<T> of(T value, ToIntFunction<? super T> hashFunction, BiFunction<? super T, ? super T, Boolean> equalsFunction) {
        return new Wrapper<>(value, hashFunction, equalsFunction);
    }
    public T value() {
        return value;
    }
    @Override
    public int hashCode() {
        if (hashCode == 0) {
            hashCode = value == null ? 0 : hashFunction.applyAsInt(value);
        }

        return hashCode;
    }
    @Override
    public boolean equals(Object obj) {
        return (obj == this) || (obj instanceof Wrapper && equalsFunction.apply(((Wrapper<T>) obj).value, value));
    }
    // TODO ...
}
123-xyz
  • 619
  • 4
  • 5