0

I have several collections of strings (~100k elements). They are initialized from the same network source:

List<String> source = Arrays.asList("cat Tom", "dog Woof", "bird Carry", "fish Silent", "mice Dr Black");
List<String> list1 = new ArrayList<>(source);
List<String> list2 = new ArrayList<>(source); 
List<String> list3 = new ArrayList<>(source);

During their life they could be modified independently:

list1.add("raccoon George");
list2.set(2, "bird Frank"); // Different bird
list3.remove(2); // No birds!

These collections live in different JVMs, but have the same (shared) seed:

String seed = UUID.randomUUID().toString();

I need some sort/shuffle method (or Comparator) to randomize these collections:

magicSort(list1, seed); // JVM #1
magicSort(list2, seed); // JVM #2
magicSort(list3, seed); // JVM #3

To receive consistent/repeatable output like this:

[mice Dr Black, bird Carry, raccoon George, cat Tom, dog Woof, fish Silent]
[mice Dr Black, bird Frank, cat Tom, dog Woof, fish Silent]
[mice Dr Black, cat Tom, dog Woof, fish Silent]

This doesn't work:

  • Collections.shuffle() from this answer - because it does not provide affinity for items with the same prefix;
  • keep randomized order in Map<String, UUID> - because of (a) collection's dynamic nature and (b) distributed setup.

Any help will be appreciated

Edit #1

This doesn't work:

  • sha1, md5 and similar hash algorithms - because they do not provide affinity for items with the same prefix - bird Carry and bird Frank could be placed far from each other.

Edit #2

Additional condition: items with the same prefix should be grouped together (affinity), e.g. all birds are placed between mices and raccoons.

ursa
  • 4,404
  • 1
  • 24
  • 38
  • You can't randomize two things and expect the same result, right? It wouldn't be truly random then. It might happen, but should be just "causality". – x80486 Nov 18 '19 at 18:16
  • calculating sha1(listValue + seed) and sorting naturally (with simple string comparison) based on this value would work? – David Szalai Nov 18 '19 at 18:23
  • or substitute sha1 with any hash that fits your needs (e.g. consistent hashing algo, etc...) – David Szalai Nov 18 '19 at 18:24
  • thanks for a hashing idea. it looks promising. I'll try this direction. – ursa Nov 18 '19 at 20:00
  • Why does `Collections.shuffle()` from [this answer](https://stackoverflow.com/a/4229001/2078908) not work? It does precisely what you want. “because of its nature” is not a meaningful reason. – Holger Nov 20 '19 at 09:44
  • It does not provide affinity for elements with identical prefixes and work exactly the same as sorting based on hash algorithms – ursa Nov 20 '19 at 10:08
  • So you don’t want actual randomness, contrary to what the question’s title and 99% of your question suggest, but rather a different thing, which you only mention briefly in an addendum, buried in the explanation why a particular solution doesn’t work. Maybe it’s just me, but I think your should *start* your question with what you actually want. Then, there’s less need to describe why certain approaches do not fulfill the requirement. – Holger Nov 22 '19 at 10:12
  • you are right, indeed. while I wrote the question the problem becomes more clear. updated description. btw, the term "actual randomness" depends on exact problem. so it would be incorrect to say I don't need it. – ursa Nov 22 '19 at 11:06

1 Answers1

1

Many thanks for hashing hint in comments!

I've ended up with re-mapping characters order through pseudo-random hashing (similar to Fisher–Yates shuffle). The rest code was copy-pasted from String.compareTo:

private void magicSort(List<String> list, String seed) {
    list.sort(new ShuffleComparator(seed));
}

public class ShuffleComparator implements Comparator<String> {
    private static final long MAGIC = 0x5DEECE66DL;
    private final String seed;

    public ShuffleComparator(String seed) {
        this.seed = seed;
    }

    @Override
    public int compare(String s1, String s2) {
        int len1 = s1.length();
        int len2 = s2.length();
        int lim = Math.min(len1, len2);

        int seedLen = seed.length();
        long random = seed.hashCode();

        for (int k = 0; k < lim; k++) {
            random = random * MAGIC + seed.charAt(k % seedLen);
            char c1 = s1.charAt(k);
            char c2 = s2.charAt(k);
            if (c1 != c2) {
                return ((random % (c1 + 0xFFFF)) & 0xFFFF) -
                    ((random % (c2 + 0xFFFF)) & 0xFFFF) > 0 ? 1 : -1;
            }
        }
        return (random > 0 ? 1 : -1) * (len1 - len2);
    }
}

Result:

  • You can keep collections shuffle-sorted at all times
  • Items with identical prefix are grouped together (affinity)
  • The only resource (string seed) is shared between JVMs
  • Negligible impact on performance
  • Zero GC

Examples:

seed: b85a9119-3a98-4491-b503-fc2cfdc344f3
[raccoon George, bird Carry, mice Dr Black, fish Silent, cat Tom, dog Woof]
[bird Frank, mice Dr Black, fish Silent, cat Tom, dog Woof]
[mice Dr Black, fish Silent, cat Tom, dog Woof]

seed: d5378421-d0ea-4b17-80e9-1de6a163bf38
[bird Carry, dog Woof, fish Silent, raccoon George, mice Dr Black, cat Tom]
[bird Frank, dog Woof, fish Silent, mice Dr Black, cat Tom]
[dog Woof, fish Silent, mice Dr Black, cat Tom]

seed: a5f94f60-1207-4f83-9723-bbab5e3b8075
[cat Tom, raccoon George, mice Dr Black, bird Carry, dog Woof, fish Silent]
[cat Tom, mice Dr Black, bird Frank, dog Woof, fish Silent]
[cat Tom, mice Dr Black, dog Woof, fish Silent]

seed: 4b77a61f-c639-42fc-96b2-e1add9f359e9
[bird Carry, raccoon George, cat Tom, mice Dr Black, dog Woof, fish Silent]
[bird Frank, cat Tom, mice Dr Black, dog Woof, fish Silent]
[cat Tom, mice Dr Black, dog Woof, fish Silent]

seed: 2dd78e21-d4ad-4e8a-b139-5d35fe2b0481
[dog Woof, fish Silent, raccoon George, mice Dr Black, bird Carry, cat Tom]
[dog Woof, fish Silent, mice Dr Black, bird Frank, cat Tom]
[dog Woof, fish Silent, mice Dr Black, cat Tom]

Edit #1

For tests I've also implemented hash-based comparator, so would publish it here. Just in case if you don't need affinity (like in my case), but need pure shuffle:

private void magicSort(List<String> list, String seed) {
    list.sort(new HashComparator(seed, "SHA-1"));
}

public class HashComparator implements Comparator<String> {
    private static final long MAGIC = 0x5DEECE66DL;
    private final ThreadLocal<MessageDigest> messageDigest;
    private final byte[] seed;
    private final int seedHash;

    public HashComparator(String seed, String algorithm) {
        this.seed = seed.getBytes(ISO_8859_1);
        this.seedHash = seed.hashCode();
        this.messageDigest = ThreadLocal.withInitial(() -> {
            try {
                return MessageDigest.getInstance(algorithm);
            } catch (NoSuchAlgorithmException e) {
                throw new RuntimeException(e.getMessage(), e);
            }
        });
    }

    @Override
    public int compare(String s1, String s2) {
        if (s1.equals(s2)) {
            return 0;
        }
        int diff = getSignature(s1).compareTo(getSignature(s2));
        if (diff != 0) {
            return diff;
        }
        long random = seedHash;
        random = random * MAGIC + s1.hashCode();
        random = random * MAGIC + s2.hashCode();
        return ((random & 1) == 0 ? 1 : -1) * s1.compareTo(s2);

    }

    private ByteBuffer getSignature(String s) {
        MessageDigest digest = messageDigest.get();
        digest.reset();
        digest.update(seed);
        for (int i = 0, size = s.length(); i < size; i++) {
            char c = s.charAt(i);
            digest.update((byte) (c >> 8));
            digest.update((byte) c);
        }
        return ByteBuffer.wrap(digest.digest());
    }
}

Result:

  • You can keep collections shuffle-sorted at all times
  • Pure shuffle (no affinity for items with identical prefix)
  • The only resource (string seed) is shared between JVMs

Examples:

8d465fde-9310-4b6c-9709-5cc395deb45e
[raccoon George, mice Dr Black, dog Woof, fish Silent, cat Tom, bird Carry]
[bird Frank, mice Dr Black, dog Woof, fish Silent, cat Tom]
[mice Dr Black, dog Woof, fish Silent, cat Tom]

6daa90dd-f7e7-4615-a45c-fefb0de10c20
[bird Carry, mice Dr Black, fish Silent, cat Tom, dog Woof, raccoon George]
[mice Dr Black, bird Frank, fish Silent, cat Tom, dog Woof]
[mice Dr Black, fish Silent, cat Tom, dog Woof]

9f30b259-c8d1-41f5-8221-40b17cb04260
[cat Tom, mice Dr Black, raccoon George, dog Woof, fish Silent, bird Carry]
[bird Frank, cat Tom, mice Dr Black, dog Woof, fish Silent]
[cat Tom, mice Dr Black, dog Woof, fish Silent]

94b6710f-3232-453d-8009-6d81658b2cca
[dog Woof, fish Silent, bird Carry, cat Tom, mice Dr Black, raccoon George]
[dog Woof, fish Silent, cat Tom, mice Dr Black, bird Frank]
[dog Woof, fish Silent, cat Tom, mice Dr Black]
ursa
  • 4,404
  • 1
  • 24
  • 38