4

So I was looking for a faster way to calculate MD5 checksums and ran across the Fast MD5 library - but when I benchmark it with Java 7 on my machine it comes out slower than the Java version.

Either I am doing something stupid (very likely) or Java 7 has implemented a better algorithm (also likely). Here's my super simple "benchmark" - maybe I just didn't have enough coffee today...

    MD5 digest = new MD5();
    System.out.println(MD5.initNativeLibrary(true));
    byte[] buff = IOUtils.readFully(new FileInputStream(new File("blahblah.bin")), 64000000, true);
    ByteBuffer buffer = ByteBuffer.wrap(buff);
    for (int j = 0; j < 100; j++) {
        start = System.currentTimeMillis();
        String md5Base64 = Utilities.getDigestBase64(buffer);
        end = System.currentTimeMillis();
        total = total + (end-start);
    }
    System.out.println("Took " + ((total)/100.00) + " ms. for " + buff.length+" bytes");
    total = 0;
    for (int i = 0; i < 100; i++) {
        start = System.currentTimeMillis();
        digest.Init();
        digest.Update(buff);
        digest.Final();
        end = System.currentTimeMillis();
        total = total + (end-start);
    }
    System.out.println("Took " + ((total)/100.00) + " ms. for " + buff.length+" bytes");

And I get:

Took 247.99 ms. for 64000000 bytes
Took 295.16 ms. for 64000000 bytes

Per a comment I ran the benchamrk over and over and got the strangest results. The FastMD5 calculation stays the same, but the Java 7 version gets slower. ????

Took 246.54 ms. for 64000000 bytes
Took 294.69 ms. for 64000000 bytes
************************************
Took 540.55 ms. for 64000000 bytes
Took 292.69 ms. for 64000000 bytes
************************************
Took 537.07 ms. for 64000000 bytes
Took 292.12 ms. for 64000000 bytes

Gandalf
  • 9,648
  • 8
  • 53
  • 88
  • 1
    I can't see a question here. – Ash Burlaczenko Jan 31 '13 at 23:31
  • Is Fast MD5 no longer faster than the Java 7 implementation, or is my benchmark simply wrong. – Gandalf Jan 31 '13 at 23:32
  • The first problem is that you've put both methods in the same program and call one after the other. You'll get also sorts of nonsense with that. Best to split them. For warm up, it's a good idea to show the different values for a sequence of a few runs (mean and standard deviations are also good). – Tom Hawtin - tackline Jan 31 '13 at 23:36
  • 1
    @SecurityMatt - cause it's all I need. – Gandalf Jan 31 '13 at 23:57
  • 2
    @Gandalf It takes the VM/JIT time to warm up (depends upon impl. and settings) - I would loop *both* tests as a *pair* a number of times (say, 100 so each single MD5 hash is executed 100x100 or more) and see if anything changes. Also, total is *in* ms so `total/100.00` is no longer in ms .. –  Feb 01 '13 at 00:02
  • @Gandalf: So long as you're not storing passwords, or using it to verify data integrity that's fine. – SecurityMatt Feb 01 '13 at 00:26
  • @SecurityMatt - yup, that's all I'm doing. – Gandalf Feb 07 '13 at 01:01
  • 1
    Can you post an [sscce](http://sscce.org/)? – Miserable Variable Feb 07 '13 at 01:10
  • The above isn't simple enough? – Gandalf Feb 07 '13 at 02:01
  • Your micro-benchmark is at best inaccurate, at worst completely misleading. Either spend time understanding [how to write a correct micro benchmark](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) or use [a library](http://code.google.com/p/caliper/) that can benchmark your code properly. – assylias Feb 07 '13 at 22:09
  • 1
    @assylias Barring an obvious bug in his code (see my answer), I do believe that Gandalf's results make perfect sense. I think the best way to judge a benchmark is by (a) trying to predict its results and then seeing if you get numbers that match this prediction reasonably well and (b) varying the parameters of the benchmark (here data-size) and checking for the correct changes to the results (here: linear with offset). Simply using a library that claims to do things right doesn't prevent optimizations, for example, to mess up your results either. – Markus A. Feb 10 '13 at 20:06

3 Answers3

16

Let's first answer the easy part of your question:

I think your Java 7 execution time roughly doubles when you run the code again, because if you simply put the code you posted into a for loop, you forget to reset total back to 0 before the 2nd, 3rd, 4th,... run of the Java 7 test (for the first one it is probably set to 0 from the variable initialization).

So, fixing your table by simply subtracting the offset that you didn't set back to 0 gives:

Took 246.54 ms. for 64000000 bytes
Took 294.69 ms. for 64000000 bytes              <---.
************************************                |
Took 245.86 ms. for 64000000 bytes   (subtracting 294.69)
Took 292.69 ms. for 64000000 bytes              <---.
************************************                |
Took 244.38 ms. for 64000000 bytes   (subtracting 292.69)
Took 292.12 ms. for 64000000 bytes

Now, things seem very consistent and even show the "JVM warm-up" mentioned in one of the other replies and that it only makes a difference of about 1%.

Now, why is Java 7 performing better than FastMD5?

They probably used an even better algorithm that is more tuned to the optimizations performed by the Java compiler afterwards.

For example, the nio ByteBuffers are specifically designed to give faster access to memory by using native things like DMA. So, the fact that the Java 7 implementation of MD5 uses a ByteBuffer as an input rather than a byte[] makes me think that they are actually making use of these capabilities (Otherwise they would have probably also just taken the byte[].)

To say anything more, we would probably need to know what your Utilities object does exactly, for example, and then compare source codes for FastMD5 and the Java implementation.

But I would say: Your results (given the total=0 fix) make perfect sense to me and you can probably just enjoy the fact that you can do with one less dependency on an external library! ;)

BTW: The difference in performance you are seeing corresponds to only around 2-3 CPU clock cycles per processed byte of data (out of a total of around 15 clock cycles per byte) on a 3.5GHz CPU. So, given that the difference is quite tiny, it is very possible, that it will depend on the exact platform and JVM used, which one of the two ends up being faster.

Addition

Your benchmarking numbers suggest that you can process about 220-260MB/s with the two MD5 implementations, which sounds reasonable if you look at other claimed specs that a Google search reveals (e.g. http://www.zorinaq.com/papers/md5-amd64.html under "Resulting Implementation"). So, contrary to all the other replies you received, I do feel like I would trust your numbers.

If you want to be extra sure, vary the size of the byte[] and look at the resulting change in processing time. If things work as they should, you will see a linear relationship, that you can fit with this function:

total/100.0 = m * buff.length + b           (your usual y = mx + b)

Here, m is the processing time per byte and should be around 1 / 250MB/s = 4ns/byte and b is the setup time that the function uses to initialize local variables, etc, as well as the time that System.currentTimeMillis(); takes. This number should be fairly small (probably less than 1ms).

Then, to determine which of the two algorithms is better suited for you, you need to compare m AND b. If you usually process small data arrays, b might become more important than m in determining which algorithm is better, while for large data-sets, the algorithm with the smaller m is better

Markus A.
  • 12,349
  • 8
  • 52
  • 116
6

I wrote my own benchmark. My answer:

It Depends!

Here are my results (running on 3.4-trunk-amd64 linux and Java 1.7.0_05):

1.) For small amouts of data, Java wins.

TINY DATA new byte[12]      SMALL DATA new byte[123]

Java builtin MD5...         Java builtin MD5...
encode 55 MB/s              encode 217 MB/s
encode 55 MB/s              encode 215 MB/s

Java Fast-MD5...            Java Fast-MD5...
encode 31 MB/s              encode 150 MB/s
encode 32 MB/s              encode 159 MB/s

Native Fast-MD5...          Native Fast-MD5...
encode 22 MB/s              encode 133 MB/s
encode 22 MB/s              encode 133 MB/s

2.) From 1KB of data and onwards, Native Fast-MD5 always wins:

MEDIUM DATA new byte[1234]  LARGE DATA new byte[12345]

Java builtin MD5...         Java builtin MD5...
encode 351 MB/s             encode 366 MB/s
encode 351 MB/s             encode 369 MB/s

Java Fast-MD5...            Java Fast-MD5...
encode 300 MB/s             encode 325 MB/s
encode 298 MB/s             encode 322 MB/s

Native Fast-MD5...          Native Fast-MD5...
encode 434 MB/s             encode 582 MB/s
encode 450 MB/s             encode 574 MB/s

3.) Speeds appear to stabilize after 12KB. No real change for 123KB:

X-LARGE DATA new byte[123456]

Java builtin MD5...
encode 367 MB/s
encode 370 MB/s

Java Fast-MD5...
encode 325 MB/s
encode 324 MB/s

Native Fast-MD5...
encode 571 MB/s
encode 599 MB/s

Conclusions:

  • Java's builtin MD5 always beats Fast-MD5's fallback (non-native) implementation in my setup.

  • All the implementations speed up as the data pieces get larger.

  • Fast-MD5's native implementation is the winner with larger pieces of data (1KB or bigger).

Question for Gandalf:

  • Are you sure you are setting up your Fast-MD5 installation to properly use the native code (e.g., Fast-MD5 is able to find MD5.so or MD5.dll)?

It's not really possible for me to put together a Benchmark as an "sscce" --- it's 150 lines! You can download it here, instead:

http://juliusdavies.ca/base64bench/

Run it like so (after building it with ant):

java ca.juliusdavies.base64bench.MD5BenchByte2Byte MD5.so

Here's a direct link to the benchmark source code:

http://juliusdavies.ca/base64bench/exploded/base64bench/src/java/ca/juliusdavies/base64bench/MD5BenchByte2Byte.java.html

Community
  • 1
  • 1
Julius Musseau
  • 4,037
  • 23
  • 27
0

When profiling, the following rules are important:

  1. You care about the amortized case, not the first run. So run the tests repeatedly in a loop and wait for it to settle.

  2. You need to be careful about the profiling itself. In your case, the first few runs of System.currentTimeMillis taking much longer than later runs could completely skew your performance metrics.

  3. Always take measurements of large numbers of iterations, never measure a single iteration in isolation.

  4. The number of iterations needs to be large to have any significance, and you need to run your test lots of times to get an unbiased assessment.

Try running something a bit more like the following:

MD5 digest = new MD5();
System.out.println(MD5.initNativeLibrary(true));
byte[] buff = IOUtils.readFully(new FileInputStream(new File("blahblah.bin")), 64000000, true);
ByteBuffer buffer = ByteBuffer.wrap(buff);

int iterations = 10000;

while(true)
{
   //
   // 1. Run the first test:
   //
   start = System.currentTimeMillis();
   for (int j = 0; j < iterations; j++) {
       String md5Base64 = Utilities.getDigestBase64(buffer);
   }
   end = System.currentTimeMillis();
   System.out.println("(1) " + (start - end) );

   //
   // 2. Run the second test:
   //
   start = System.currentTimeMillis();
   for (int i = 0; i < iterations; i++) {
      digest.Init();
      digest.Update(buff);
      digest.Final();
   }
   end = System.currentTimeMillis();

   System.out.println("(2) " + (start - end) );
}
SecurityMatt
  • 6,593
  • 1
  • 22
  • 28
  • I ran each 100 times, 1000 times, no difference. – Gandalf Feb 07 '13 at 19:23
  • Did you try fixing any of the other problems that I pointed out in my answer (for example, did you try something that looks like the code I posted)? Also 1000 times is not very many. Try running it 100,000 times. – SecurityMatt Feb 07 '13 at 19:45
  • Running the 2 tests within the same method makes no sense. Only the one of them will get optimised. And 10,000 iteratinos is too little, it won't even let the JIT compile anything... – assylias Feb 07 '13 at 22:11
  • @assylias: Both of them will be optimised. And I'm not running the test 10,000 times. I'm running the test 10,000 times each time I go round the while(true) loop. – SecurityMatt Feb 07 '13 at 22:18
  • @SecurityMatt I missed the outer loop indeed - however, if you check the compilation messages, it is likely that you will see the compilation take place at the first while loop, between the first and the second tests. – assylias Feb 07 '13 at 22:20
  • @assylias: Good benchmarking requires that you discard the first several "rounds". After that, both (1) and (2) are as optimised as they're ever going to get, and you start printing out a stream of tests corresponding to the relative speed of (1) and (2). Some tests will be skewed by GC operations or re-JITs, but over time a pattern will emerge showing one as faster than the other. – SecurityMatt Feb 07 '13 at 22:22
  • @SecurityMatt It is not certain at all that both tests will be optimised properly with your setup - for example if you split it in 2 methods, the JIT might do more inlining than if you keep them in the same main loop - also in your first loop, the String is a local variable - it is very possible that the JIT will simply consider it as a no-op and will optimise the whole loop away if it can prove that the called method has no side-effect (I don't know if it is the case). What I'm trying to say is that your approach might or might not produce good results but it is not very robust. – assylias Feb 07 '13 at 22:28
  • 1
    @assylias: I didn't change the innards of the loops from the ops original. Note that the compiler might be able to optimise the second loop away as a no-op too, since it can inline all of the code and remove assignments that aren't checked. Benchmarks are inherently biased because they measure code in an unnatural location. My code snippet wasn't so much a "here's the best benchmark to run", but more of a collection of critiques of the op's own benchmarks. – SecurityMatt Feb 08 '13 at 01:21