12

Motivation

I've just rewritten some 30 mostly trivial parsers and I need that the new versions behave exactly like the old ones. Therefore, I stored their example input files and some signature of the outputs produced by the old parsers for comparison with the new ones. This signature contains the counts of successfully parsed items, sums of some hash codes and up to 10 pseudo-randomly chosen items.

I thought this was a good idea as the equality of the hash code sums sort of guarantee that the outputs are exactly the same and the samples allow me to see what's wrong. I'm only using samples as otherwise it'd get really big.

The problem

Basically, given an unordered collection of strings, I want to get a list of up to 10 of them, so that when the collection changes a bit, I still get mostly the same samples in the same positions (the input is unordered, but the output is a list). This should work also when something is missing, so ideas like taking the 100th smallest element don't work.

ImmutableList<String> selectSome(Collection<String> list) {
        if (list.isEmpty()) return ImmutableList.of();
        return IntStream.range(1, 20)
            .mapToObj(seed -> selectOne(list, seed))
            .distinct()
            .limit(10)
            .collect(ImmutableList.toImmutableList());
    }

So I start with numbers from 1 to 20 (so that after distinct I still most probably have my 10 samples), call a stateless deterministic function selectOne (defined below) returning one string which is maximal according to some funny criteria, remove duplicates, limit the result and collect it using Guava. All steps should be IMHO deterministic and "ordered", but I may be overlooking something. The other possibility would be that all my 30 new parsers are wrong, but this is improbable given that the hashes are correct. Moreover, the results of the parsing look correct.

String selectOne(Collection<String> list, int seed) {
    // some boring mixing, definitely deterministic
    for (int i=0; i<10; ++i) {
        seed *= 123456789;
        seed = Integer.rotateLeft(seed, 16);
    }
    // ensure seed is odd
    seed = 2*seed + 1;

    // first element is the candidate result
    String result = list.iterator().next();
    // the value is the hash code multiplied by the seed
    // overflow is fine
    int value = seed * result.hashCode();

    // looking for s maximizing seed * s.hashCode()
    for (final String s : list) {
        final int v = seed * s.hashCode();
        if (v < value) continue;
        // tiebreaking by taking the bigger or smaller s
        // this is needed for determinism
        if (s.compareTo(result) * seed < 0) continue;
        result = s;
        value = v;
    }
    return result;
}

This sampling doesn't seem to work. I get a sequence like

"9224000", "9225000", "4165000", "9200000", "7923000", "8806000", ...

with one old parser and

"9224000", "9225000", "4165000", "3030000", "1731000", "8806000", ...

with a new one. Both results are perfectly repeatable. For other parsers, it looks very similar.

Is my usage of streams wrong? Do I have to add .sequential() or alike?

Update

Sorting the input collection has solved the problem:

ImmutableList<String> selectSome(Collection<String> collection) {
    final List<String> list = Lists.newArrayList(collection);
    Collections.sort(list);
    .... as before
}

What's still missing is an explanation why.

The explanation

As stated in the answers, my tiebreaker was an all-breaker as I missed to check for a tie. Something like

if (v==value && s.compareTo(result) < 0) continue;

works fine.

I hope that my confused question may be at least useful for someone looking for "consistent sampling". It wasn't really Java 8 related.

I should've used Guava ComparisonChain or better Java 8 arg max to avoid my stupid mistake:

String selectOne(Collection<String> list, int seed) {
    .... as before
    final int multiplier = 2*seed + 1;
    return list.stream()
          .max(Comparator.comparingInt(s -> multiplier * s.hashCode())
          .thenComparing(s -> s)) // <--- FOOL-PROOF TIEBREAKER
          .get();
}
maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • 13
    "Basically, given **an unordered collection** of strings, I want to get a list of up to 10 of them, so that when the collection changes a bit, I still get about the same samples in **about the same positions**. " What do "positions" mean in an unordered collection? – Andy Turner Aug 25 '17 at 07:01
  • 3
    Also, you posted the new code, said it's supposed to work the same way as the old one, but didn't post the old one. – JB Nizet Aug 25 '17 at 07:02
  • @AndyTurner Bad formulation, hopefully fixed. The input is unordered, the output is a list. When one output is `a, b, c, d` and the other is `a, x, c, d`, that's much better for (visual) comparison than `a, c, d, x`. – maaartinus Aug 25 '17 at 07:07
  • 4
    I don't understand. If the input collection (`list`) is unordered, then how is `list.iterator().next()` supposed to be deterministic? – shmosel Aug 25 '17 at 07:11
  • 1
    @JBNizet I see, I was very unclear... What I posted is the part common to the old and new code. What has changed, are the parsers producing some items and their item numbers are the inputs of `selectSome`. My problem is either caused by the code I posted here, or by some inexplicable error somewhere else... I wouldn't have asked this terribly long question, if I hadn't triple checked everything. – maaartinus Aug 25 '17 at 07:11
  • @shmosel It is not. But I'm searching a single string which maximizes a function, so the iteration order does not matter neither does the starting element (provided by `list.iterator().next()`). It would only matter, if there were ties, but my tiebreaker should take care of them. – maaartinus Aug 25 '17 at 07:13
  • 1
    @maaartinus I've read your question four times and I still do not understand how these parsers should work, what they are used for? – Krzysztof Cichocki Aug 25 '17 at 07:14
  • 1
    @maaartinus I'll have to agree with the previous comment - it's probably a great question (I've actually read and seriously like some of your answers), so I bet there is something really good here, but the overall question is really unclear IMO, I've read it like 10 times now... – Eugene Aug 25 '17 at 07:16
  • 1
    When you say that "it is entirely repeatable", what I take from this is that the process you are employing here *is* deterministic, including the streams. It's the process generating the `Collection` which isn't. – Andy Turner Aug 25 '17 at 07:16
  • @KrzysztofCichocki They're parsing something. They have become very ugly in the meantime, so I decided to rewrite them... and what I'm describing in this question is some code used for testing that they work the same as the old versions. I was hoping I could explain the motivation, but I maybe should have leave it out. – maaartinus Aug 25 '17 at 07:17
  • 1
    Why do you need this hashed ordering? Have you tried sorting the input naturally? – shmosel Aug 25 '17 at 07:19
  • @Eugene I recall and like your answers, too. I wanted to simplify the explanation of what I'm doing by giving the motivation, but it probably was no good idea as the motivation seems to be the harder part. :D – maaartinus Aug 25 '17 at 07:21
  • @shmosel Sure, I could sort the sequence... I can't do things like take `take the 100th smallest element`, but sorting can't do any harm... and maybe it helps. – maaartinus Aug 25 '17 at 07:23
  • 1
    @maaartinus Can you please post the code for old parsers? It would be easier to spot the possible problem having the "old" code to compare. – Krzysztof Cichocki Aug 25 '17 at 07:23
  • 3
    Can we assume the input ordering is consistent between runs, but not between the old and new parsers? – shmosel Aug 25 '17 at 07:27
  • @KrzysztofCichocki See my above comment answering to JBNizet. This is neither the new nor the old code, this is the common part used for testing that the new code works just like the old one. – maaartinus Aug 25 '17 at 07:28
  • 1
    @shmosel Yes. And you were right... **there's something wrong with the posted code** as **sorting the input changes the outcome**. No idea how it's possible, but it's a big step forwards. – maaartinus Aug 25 '17 at 07:30
  • @shmosel Yes. More precisely, there are still some differences, but they're rare and obviously caused by minor bugs in some of the parsers (that's why I wrote this code). – maaartinus Aug 25 '17 at 07:43
  • I can't see how the maximizing loop in ´selectOne´ should give results independent of the order of the input. The overflow is in fact a problem, I think. – Calculator Aug 25 '17 at 07:56
  • @Calculator Given that it indeed doesn't work, you may be right. My reasoning was as follows: 1. With or without overflow, some function `String -> int` gets computed (overflow is nothing but a modulo operation). 2. The returned string is the one maximizing this function. 3. The function has only collisions when `hashCode` does, so they're only few of them. 4. A collision is only relevant when it happens for the maximizing string, this way producing a tie. Ties are rare and can't lead to the observed behavior. 5. Ties get broken deterministically, so the output should be order-independent. – maaartinus Aug 25 '17 at 08:11
  • 1
    I don't know if it was intentional, but you dodged a bullet by using odd seeds. Even seeds can break symmetry for `s.compareTo(result) * seed`. – shmosel Aug 25 '17 at 09:24
  • @shmosel Even multipliers (seeds) are terrible in general., e.g, for `.comparingInt(s -> multiplier * s.hashCode())`, but also for the expression you mentioned. I removed the multiplier from the tiebreaker as ties are rare and therefore the decision doesn't matter. I could've used `s.compareTo(result) ^ seed` in order to avoid problems, namely overflow (`compareTo` needn't return just -1,0,1) +++ Even multipliers in anything hashing-like are bad; the more trailing zeros they have, the worse, zero being the worst. Odd multipliers are all fine. – maaartinus Aug 25 '17 at 09:33
  • Why didn’t you use `java.util.Random`, which does already provide a working deterministic pseudo random number generator? Your code does not look better than the algorithm of `java.util.Random`… – Holger Oct 16 '17 at 12:12
  • @Holger You mean instead of my multiply-rotate algorithm? I probably wanted to avoid repetitions, i.e., get different multipliers from a range of `seed`s. But this doesn't really work because of the making the multiplier odd. Anyway, repetitions in a such short sequence are improbable both with my algorithm and with `j.u.R.`. +++ Concerning better-looking, my algorithm is different, as it computes a bijective function of the seed. I'd bet that (unlike `j.u.R.`) it has has no easily detectable patterns, bit this doesn't matter either. I guess, I'll switch to `'j.u.R` some day. – maaartinus Oct 16 '17 at 18:57
  • Well, you are feeding the numbers `1…19` as a seed to your function and require your `selectOne` to do something useful with that. I’d rather instantiate a *single* `j.u.R` instance with a fixed seed (`0xcafebabe` or `0xdeadc0de` or the time millis when writing the code or a true random number picked then) and use that instance to pick ten distinct elements from the source list. That eliminates the need for the `selectOne` method completely. It’s reproducible when using the same seed, but very unlikely to have a detectable pattern among the ten elements. – Holger Oct 17 '17 at 06:29
  • @Holger I can generate a pseudorandom sequence, use it instead of `1..19` and drop the multiply-rotate loop. I *can not* drop `selectOne` as I want more than just reproducibility: When some elements was chosen in a previous run and the computation has changed, resulting in some elements changed, removed and/or inserted, but the changed element stayed the same, I want it to be selected again with some high probability. By storing a couple of such elements, the chances are good, that some elements are present in both collections, so I can compare them. – maaartinus Oct 18 '17 at 00:12
  • @Holger My explanation is surely confusing. Imagine you want to see how a group of people (a school class) change over the time without storing the information of all of them. Let's assume, each of them has a unique SSN. When I store the info of the ones with the smallest and the biggest SSNs, then the chances are good, that at least one of the now chosen persons will be present in the class the next year and can be found using the same criterion. What I did, is just a generalization to a bigger selection from a bigger set (instead of smallest and biggest I use arg max `seed * s.hashCode()`). – maaartinus Oct 18 '17 at 00:24
  • 1
    If I understood correctly, you could just use the `j.u.R` to generate a random string and pick the set’s string with the smallest distance to it. Unlike the hash code, for which collisions are unavoidable, there can be at most two different strings being closest to the reference. You can use `nextBoolean()` to decide whether to use the smaller or bigger one in that case. Or keep on always using the bigger one like in your tie breaking comparator. – Holger Oct 18 '17 at 06:38

2 Answers2

12

The mistake is that your tiebreaker is not in fact breaking a tie. We should be selecting s when v > value, but instead we're falling back to compareTo(). This breaks comparison symmetry, making your algorithm dependent on encounter order.

As a bonus, here's a simple test case to reproduce the bug:

System.out.println(selectOne(Arrays.asList("1", "2"), 4));  // 1
System.out.println(selectOne(Arrays.asList("2", "1"), 4));  // 2
Erik A
  • 31,639
  • 12
  • 42
  • 67
shmosel
  • 49,289
  • 6
  • 73
  • 138
7

In selectOne you just want to select String s with max rank of value = seed * s.hashCode(); for that given seed.

The problem is with the "tiebreaking" line: if (s.compareTo(result) * seed < 0) continue;

It is not deterministic - for different order of elements it omits different elements from being check, and thus change in order of elements is changing the result.

Remove the tiebreaking if and the result will be insensitive to the order of elements in input list.

Krzysztof Cichocki
  • 6,294
  • 1
  • 16
  • 32