Project Euler 35: HashSet gives incorrect results

Question

I wrote a Java program for Project Euler #35: Circular Primes:

The number, 197, is called a circular prime because all rotations of the digits: 197, 971, and 719, are themselves prime.

There are thirteen such primes below 100: 2, 3, 5, 7, 11, 13, 17, 31, 37, 71, 73, 79, and 97.

How many circular primes are there below one million?

My code compiles and runs fine, however, it gives different results depending on the data structure I use.

The algorithm works like this:

Get pre-calculated primes. This is the call to MathUtils.getPrimes(1000000), which gets all primes equal to or less than a million. I store this in another Set because it is implemented by returning a subset, and performance is ridiculously terrible unless I copy the primes into their own data structure.
While the set of primes is not empty, get the next prime.
Get all rotations of that prime. E.g. 197, 971, 719. These rotations need not be prime themselves, because I need to verify them anyway.
If the prime set contains all of the rotations, add the count of the rotations to a running total.
Remove all of the rotations from the set of primes if they exist.

I noticed two weird things about this code. If I use a TreeSet to store the primes the performance is very fast and it produces correct results:

Answer: 55
Time: 76ms

If I switch to a HashSet the performance is a lot worse and the results are incorrect.

Answer: 50
Time: 2527ms

I put code at the top to double check that the sets both contain the same values before the code runs, and they always do.

Why does using the HashSet produce incorrect results as compared to the TreeSet? There are no nulls or other weird values, only positive, distinct Integer instances. The sets start out containing the exact same data. The algorithm is the same, because it is the exact same code. Due to ordering differences between the implementations and the size of the data, it is nigh impossible to compare the state of the algorithms as they are running. If I reduce the input size, the two produce the same results up to 100,000.
Why does the TreeSet perform so much faster than the HashSet, when it has to perform all of those removals and tree rotations that are not applicable to a HashSet? Looking at the code for HashMap which backs HashSet, there is no resizing or shuffling of contents going on other than that localized to a specific bin. Furthermore, the primes are fairly evenly distributed. While there is no easy way to validate, I would expect that there would not be the worst-case performance issue of many items occupying a small number of bins in the table.

Code follows. You can switch the Set implementation by swapping the variable names at the top.

import java.util.Collection;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.NavigableSet;
import java.util.TreeSet;

public class Problem_0035 {

  public static void main(String[] args) {
    // Swap these two variable names to compare.
    Collection<Integer> primes = new TreeSet<>(sieve(1000000));
    Collection<Integer> primes2 = new HashSet<>(sieve(1000000));
    if (!primes.containsAll(primes2) || !primes2.containsAll(primes)
        || (primes.size() != primes2.size())) {
      System.out.println("Primes are not the same!");
    }
    final long start = System.currentTimeMillis();
    int result = 0;
    // Keep getting a prime and checking for its rotations. Remove the primes checked.
    while (!primes.isEmpty()) {
      Integer next = primes.iterator().next();
      Collection<Integer> rotations = getRotations(next);
      if (primes.containsAll(rotations)) {
        result += rotations.size();
      }
      primes.removeAll(rotations);
    }
    System.out.println("Answer: " + result);
    // 55
    System.out.println("Time: " + (System.currentTimeMillis() - start) + "ms");
  }

  /** Enumerate all rotations of the given integer. */
  private static Collection<Integer> getRotations(Integer argValue) {
    Collection<Integer> results = new LinkedList<>();
    final int start = argValue.intValue();

    // Count the digits
    int magnitude = 1;
    for (int i = start; i > 9; i /= 10) {
      magnitude *= 10;
    }

    int current = start;
    do {
      results.add(Integer.valueOf(current));
      current = ((current % 10) * magnitude) + (current / 10);
    } while (current != start);

    return results;
  }

  /** Sieve of Eratosthenes. */
  private static Collection<Integer> sieve(int argCeiling) {
    NavigableSet<Integer> primes = new TreeSet<>();
    for (int i = 2; i <= argCeiling; ++i) {
      primes.add(Integer.valueOf(i));
    }
    for (Integer number = primes.first(); number != null; number = primes.higher(number)) {
      int n = number.intValue();
      for (int i = n * 2; i <= argCeiling; i += n) {
        primes.remove(Integer.valueOf(i));
      }
    }
    return primes;
  }

 //
 // Filter the set through this method to remove the problematic primes.
 // See answers for an explanation.
 //

 /**
   * Any prime number with a zero or five anywhere in its number cannot have prime
   * rotations, since no prime can end in five or zero. Filter those primes out.
   */
  private static Collection<Integer> filterImpossiblePrimes(Collection<Integer> in) {
    Collection<Integer> out = new TreeSet<>();
    for (Integer prime : in) {
      if (!willBeRotatedComposite(prime)) {
        out.add(prime);
      }
    }
    return out;
  }

  /** If the prime is guaranteed to be rotated to a composite, return true. */
  private static boolean willBeRotatedComposite(Integer prime) {
    int p = prime.intValue();
    boolean result = false;
    if (p > 10) {
      while (p > 0) {
        // Primes must end in 1, 3, 7, or 9. Filter out all evens and 5s.
        if ((p % 5 == 0) || (p % 2 == 0)) {
          result = true;
          break;
        }
        p /= 10;
      }
    }
    return result;
  }

}

Have a look here http://stackoverflow.com/questions/1463284/hashset-vs-treeset — Diversity, Jun 26 '15 at 22:35
@Diversity that backs up what I am saying: HashSet is faster in pretty much every case but does not guarantee order (order does not matter to this algorithm). Yet in this case, it is two orders of magnitude _slower_. Not to mention that _something_ is causing the algorithm to return different results. — , Jun 26 '15 at 22:38
Did you try to create a JUnit test case which at least ensures when implemented correctly that both implementations use the same preconditions. Additionally you ensure that you test both cases within one virtual machine. — Diversity, Jun 26 '15 at 22:42
What MathUtils are you using? It's not present in the apache commons binaries. — G. Bach, Jun 26 '15 at 22:55
@G.Bach That is an internal class I am using. That method call works by loading primes from a text file. There is no way I can fit the code and that file in here, so I am going to switch to a sieve implementation and update my code. Stand by. — , Jun 26 '15 at 23:00
(I slightly misphrased my last comment; there is a MathUtils in apache commons, it just doesn't have a getPrimes(..) method.) — G. Bach, Jun 26 '15 at 23:04
I you read the [javadoc of `TreeSet`](https://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html), you'll find `The Set interface is defined in terms of the equals operation, but a TreeSet instance performs all element comparisons using its compareTo method, so two elements that are deemed equal by this method are, from the standpoint of the set, equal`. I think it may have something to do with it. Since you have less elements with `HashSet`, it would mean that some elements were considered equal with `Integer#compareTo` but not with `Integer#equals`, which sounds weird... TBC :) — sp00m, Jun 26 '15 at 23:31
Also, I don't get 50 when using a HashMap, but 46... On JDK 7u71. — sp00m, Jun 26 '15 at 23:34
wow... why do you call sieve(1000000) twice ??? can't you just create the HashSet from the TreeSet or vice versa ? — gen-y-s, Jun 27 '15 at 06:28
One thing that would only be fair in using `HashMap/Set`: you _know_ how many elements you are going to add: do yourself a favour and specify that in the constructor invocation (this does not help behaviour after `remove`als). As to the problem, even _if_ you are generating all primes upfront, you don't need to construct a set of the all of them and test rotations. Think about the digits that define the meaning of _rotation_. — greybeard, Jun 27 '15 at 07:16
The inefficient construction does not affect the running time of the algorithm: I don't start the clock until _after_ the primes are initialized. The final version only has one set and only gets primes once. None of that matters for the specific questions I asked, anyway. — , Jun 27 '15 at 11:10
Why dont you use "LinkedHashMap" which should give you the same order as TreeMap in this case ? — gen-y-s, Jun 27 '15 at 16:24
Right off the bat you can discard any number (except single digit numbers), which contains any digit which is not 1,3,7,9. This will hugely improve your algorithm running time, and take care of the "leading zero" problem, as any number which contains a zero should be dicarded immediately. — gen-y-s, Jun 27 '15 at 16:32
Note that algorithms based on removal tend to get more complicated and slower than those not modifying anything. Instead of `result += rotations.size();`, just `result++`, leave the primes alone, and iterate simply using `for (int next : primes)`. Getting rid of up to six primes at once sounds smart, but it isn't as **your operations are much more costly**. — maaartinus, Jun 27 '15 at 16:33
Also, you can discard any number which is a repetition of the same digit (e.g. 7777) when that digit is not 1. This is because and such number dddd = 1111 * d (where d=2....9). — gen-y-s, Jun 27 '15 at 16:37
@Snowman, sure I understand it doesnt matter, but it really hurts to see something like that..... — gen-y-s, Jun 27 '15 at 16:39
@gen-y-s also, please be aware that several parts of the code I posted here I changed specifically for testing and to illustrate the errors I was receiving. The final (working) code I now use does not instantiate multiple data structures. — , Jun 27 '15 at 16:47

DMulligan · Answer 1 · 2015-06-27T00:09:15.900

2

There's 2 bugs in your code:

1) The order does matter. Example: 2 is a prime number which passes the rotations test. 20 is not. A rotation of 20 is 2. Your code will therefore remove 2 and not count it, if it randomly iterates over 20 first. Here's a change to the getRotations function which would cause Tree/Hash Set to have equivalent results:

int current = start;
do {
   int currMagnitude = 1;
   for (int i = current; i > 9; i /= 10) {
      currMagnitude *= 10;
   }
   if (currMagnitude == magnitude)
       results.add(current);
   current = ((current % 10) * magnitude) + (current / 10);
} while (current != start);

2) You're deleting elements from a collection while you iterate over it. You're not supposed to do this in Java. I suspect if you modified your code like this both TreeSet and HashSet would have roughly equivalent speed:

Collection<Integer> primesCopy = new HashSet<>(primes);
for(Integer i in primesCopy) {
     if(!primes.contains(i)) continue;
     // rest of code as it was

edited Jun 27 '15 at 00:09

answered Jun 26 '15 at 23:43

DMulligan

8,993
6
33
34

The number 20 never comes up (a print statement verifies this). Since only primes are fed into the rotation function, 20 can never be entered. Furthermore, single digits rotate to themselves and break out of the loop (I verified this in the debugger when I break on a single-digit value). – Jun 26 '15 at 23:48
I gave 20 just as an example. The actual suspect is probably a large prime number with a 0 in the middle. – DMulligan Jun 26 '15 at 23:49
He's not deleting from the set while iterating over it, he does so after iterating over it, and he's not using the iterator after deletion. The business with primes containing zeroes might be a possible issue. Not sure why the treeset would handle that correctly, though. – G. Bach Jun 26 '15 at 23:49
I didn't say he deleted the Set, I said he is deleting elements from the set while iterating over it. – DMulligan Jun 26 '15 at 23:51
I just trapped for 107 and stepped through the loop. It correctly went from 107 to 710 to 71 and back to 107 where it broke out of the loop. – Jun 26 '15 at 23:51
@AFinkelstein I am not deleting from the set while iterating the set: I am iterating on "set not empty" and acquiring a new iterator each loop iteration for the sole purpose of getting an element (any single element) from the set. List and TreeSet provide a way to get the first element, HashSet does not. Creating an iterator and calling `next()` is a safe way to get an element from any `Collection`. – Jun 26 '15 at 23:53
That was a typo, thanks for pointing it out; he is still not deleting from it during iteration, but after, and he doesn't use the iterator after he deletes from the set. – G. Bach Jun 26 '15 at 23:53
I tested my suggestions. His code is failing on 930719 which removes 71993 and its rotations from the list when they should be there. If you copy the Set and iterate over the copy, the HashSet and TreeSet have the same speeds with HashSet being slightly faster. – DMulligan Jun 26 '15 at 23:55
@AFinkelstein I am not sure what you mean. I wrote a JUnit test for the `getRotations()` method and 930719 appears to be rotated correctly. Furthermore, _all_ rotations are being removed from the set on purpose. If a value does not exist, nothing happens. Regardless, I do not want to count it again. – Jun 27 '15 at 00:17
1

I'm saying that 71993 is a prime number and all its rotations are prime numbers. 930719 is a prime number rotated to 71993. HashSet is iterating over 930719 first (at least with my version of Java). So 71993 gets removed and never added. Then since its no longer included in your prime set, the other 4 rotations of 71993 can't be added because 71933 isn't included in your prime set. TreeSet iterates over 19937 first, so all 5 rotations get added. Try adding my fix, you'll see you get the same results then. – DMulligan Jun 27 '15 at 00:25
To the iterator speed part: it's not that it won't work the way you're iterating HashSet, it's just that it's not the way you're expected to iterate sets. Creating thousands of iterators on a changing hashtable will be slower than doing it on a tree with the left most node continously being popped off. – DMulligan Jun 27 '15 at 00:27
1

That sounds more like the actual issue, which isn't that the way he deletes items from the set is messing anything up, it's the order in which the elements are processed. – G. Bach Jun 27 '15 at 00:34
Makes sense. Given a prime number _n_ containing a zero, one of its rotations will have 0 in the unit position, meaning not every rotation is prime because such a number will be divisible by 10. If the rotation of the number with the zero in the leading position is prime (and that number without the leading zero has prime rotations) then that value will not be counted. By removing that zero, the other prime _must_ be smaller than _n_, making iteration order crucial in the general case. – Jun 27 '15 at 00:37
@AFinkelstein you nailed it, but the answer as written is not correct. If you update it and give a detailed example of what I explained in my previous comment, I will accept your answer. – Jun 27 '15 at 00:38
@G.Bach Deleting items isn't the issue with correctness but it is the issue with speed. Calling set.iterator().next() repeatedly on a static HashSet would be as fast as doing it on a TreeSet. Still think my original answer was correct as-is, but feel free to modify it if you'd like. – DMulligan Jun 27 '15 at 01:01
See my latest code update. I didn't edit it to use the code I added, but reducing the number of viable primes from 78,498 to 1,113 makes the choice of data structure matter a whole lot less while also removing the constraint that iteration must occur in ascending order. TreeSet still outperforms HashSet on my system, but the difference is a lot less now. – Jun 27 '15 at 01:06
@AFinkelstein The way the next item is selected is the issue with speed, I agree. I'm not entirely convinced that using a separate hashset to iterate over while checking the hashset version is a fix, it seems to me the order in which we get the primes to check depends on the hash, and a different (equally suitable) hashing function might give us a different order that still brings up the problems with zeroes you mentioned. It seems to me deleting is not the way to go, storing circular primes in a result set and taking its cardinality in the end should be safer. – G. Bach Jun 27 '15 at 11:31
@Snowman, 71 is not a rotation of 107 . A rotation must have the same length as the original. Duh !!! – gen-y-s Jun 27 '15 at 21:26
@gen-y-s correct. That is what I fixed in the code in a roundabout way by filtering out primes containing zeros (actually all even digits and five). – Jun 28 '15 at 13:48
If you use LinkedHashMap, the slowness of iteration should be gone, since this class uses a linked list to iterate over the items, so it doesnt need to search buckets, and the order will be the order of insertion, so it would be the same as in TreeMap – gen-y-s Jun 28 '15 at 21:59
@gen-y-s the algorithm now runs in under 10ms using TreeSet. Further optimization is squeezing blood from a rock. – Jun 29 '15 at 15:27

score 1 · Answer 2 · answered Jun 26 '15 at 23:30

Some fiddling around shows that the most expensive bit with the hashset is finding the next prime to check via Integer next = primes.iterator().next(); - on my machine, the version using the hashset takes almost exactly 4 seconds, of which it spends roughly 3.9 seconds with iterator-related business.

HashSet is based on HashMap, and its iterator has to step through all buckets until it finds a non-empty one; as far as I can tell from a skim of the source code of HashMap, it does not resize itself after a delete, i.e. once you bring it to a certain capacity, you will have to resize manually if you don't insert into it. This might have the effect that once you deleted a sizable portion of the elements of the HashSet, most of its buckets are empty, so finding the first non-empty bucket becomes expensive. My best guess regarding why removing from a HashSet doesn't trigger a resize is that it wasn't built with saving space and quick iteration in mind.

This doesn't happen with the treeset; it remains pretty shallow (log₂ 128000 is roughly 17, so that's about its maximum depth since there are between 75k and 80k primes below 10^6), and all it needs to do is step through to its leftmost element to find the next one.

That doesn't explain the whole affair for my machine though, since even ignoring that, the hashset is about 30% more expensive than the treeset. My best guess why that happens is that hashing Integers is extra load that is more expensive than finding integer keys in a treeset, but that's really barely a guess, certainly not a solid argument.

Project Euler 35: HashSet gives incorrect results

2 Answers2