12

Following a question related to the way the JVM implements creation of Strings based on char[], I have mentioned that no iteration takes place when the char[] gets copied to the interior of the new string, since System.arraycopy gets called eventually, which copies the desired memory using a function such as memcpy at a native, implementation-dependent level (the original question).

I wanted to check that for myself, so I downloaded the Openjdk 7 source code and started browsing it. I found the implementation of System.arraycopy in the OpenJDK C++ source code, in openjdx/hotspot/src/share/vm/oops/objArrayKlass.cpp:

if (stype == bound || Klass::cast(stype)->is_subtype_of(bound)) {
  // elements are guaranteed to be subtypes, so no check necessary
  bs->write_ref_array_pre(dst, length);
  Copy::conjoint_oops_atomic(src, dst, length);
} else {
  // slow case: need individual subtype checks

If the elements need no type checks (that's the case with, for instance, primitive data type arrays), Copy::conjoin_oops_atomic gets called.

The Copy::conjoint_oops_atomic function resides in 'copy.hpp':

// overloaded for UseCompressedOops
static void conjoint_oops_atomic(narrowOop* from, narrowOop* to, size_t count) {
  assert(sizeof(narrowOop) == sizeof(jint), "this cast is wrong");
  assert_params_ok(from, to, LogBytesPerInt);
  pd_conjoint_jints_atomic((jint*)from, (jint*)to, count);
}

Now we're platform dependent, as the copy operation has a different implementation, based on OS/architecture. I'll go with Windows as an example. openjdk\hotspot\src\os_cpu\windows_x86\vm\copy_windows_x86.inline.hpp:

static void pd_conjoint_oops_atomic(oop* from, oop* to, size_t count) {
// Do better than this: inline memmove body  NEEDS CLEANUP
if (from > to) {
  while (count-- > 0) {
    // Copy forwards
    *to++ = *from++;
  }
} else {
  from += count - 1;
  to   += count - 1;
  while (count-- > 0) {
    // Copy backwards
    *to-- = *from--;
  }
 }
}

And... to my surprise, it iterates through the elements (the oop values), copying them one by one (seemingly). Can someone explain why the copy is done, even at the native level, by iterating through the elements in the array?

Community
  • 1
  • 1
Andrei Bârsan
  • 3,473
  • 2
  • 22
  • 46

1 Answers1

6

Because the jint most closely maps to int which most closely maps to the old hardware architecture WORD, which is basically the same size as the width of the data bus.

The memory architectures and cpu processing of today are designed to attempt processing even in the event of a cache miss, and memory locations tend to pre-fetch blocks. The code that you are looking at isn't quite as "bad" in performance as you might think. The hardware is smarter, and if you don't actually profile, your "smart" fetching routines might actually add nothing (or even slow down processing).

When you are introduced to hardware architectures, you must be introduced to simple ones. Modern ones do a lot more, so you can't assume that code that looks inefficient is actually inefficient. For example, when a memory lookup is done to evaluate the condition on an if statement, often both branches of the if statement are executed while the lookup is occurring, and the "false" branch of processing is discarded after the data becomes available to evaluate the condition. If you want to be efficient, you must profile and then act on the profiled data.

Look at the branch on JVM opcode section. You'll see it is (or perhaps, just was) an ifdef macro oddity to support (at one time) three different ways of jumping to the code that handled the opcode. That was because the three different ways actually made a meaningful performance difference on the different Windows, Linux, and Solaris architectures.

Perhaps they could have included MMX routines, but that they didn't tells me that SUN didn't think it was enough of a performance gain on modern hardware to worry about it.

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • Wow, thanks! It was a bit confusing looking through the OpenJDK implementation for the first time, so I was expecting to have gotten something wrong. :P So how do you think this optimization takes place? I did do some tests, and System.arraycopy is twice as fast at copying 10000 ints than a regular Java way. In C++ a similar task is still noticeably faster, although the results might be affected by various compiler optimisations. – Andrei Bârsan Jun 26 '12 at 15:56
  • A C++ copy doesn't have a garbage collector running on a separate thread. Even if you don't generate garbage, the collector has to steal a few cycles to verify that it has no work to do. I am not sure if the compiler is unrolling the arraycopy loop or if the hardware is prefetching the entire block of the array into cache. In fact, with microcode optimization, it is beyond my depth of knowledge. That's why profiling is so important, it is the test the proves the optimization was worthwhile. – Edwin Buck Jun 26 '12 at 18:34