Can System.arraycopy(…) be faster than O(n)?

Question

Suppose I have an array of 64 bits in a 64-bit POSIX system.

Let's assume the processor cache contains my array and the registers are 64+ bits long.

Can I expect O(1) time from System.arraycopy(…)? In sense, the time of copying doesn't depend on the length of my array.

The question is related to Time complexity of System.arraycopy(...)?.

I feel I can expect it, but does it work like that under the bonnet? Do I become system- and JVM-dependent in this case?

It can't be O(1) unless there's some weird table-mapping going on which is almost certainly not going to happen. It *can* be much faster than a naive loop copying the data can be, but that difference is not captured within the big-O notation (which really is more of a theoretical tool and not useful for comparing two different ways to implement the same basic algorithm). — Joachim Sauer, Sep 13 '19 at 13:19
Of you set n to a specific number, you can replace all `n` in Big-O, resulting in constant time. That's not useful at all. — Johannes Kuhn, Sep 13 '19 at 14:55

GhostCat · Accepted Answer · 2022-01-12T07:09:34.407

3

Let's assume the processor cache contains my array

That isn't really relevant.

What is relevant: first of all, the type of the underlying array data.

As in: only when you have say, byte data[] = new byte[8], then you have 64 bits that really, sequentially sit one after the other in memory.

Because, when you have Byte data[] = new Byte[8] then you are already broken, as the slots in the array are just pointers, pointing somewhere on the heap!

So, let's rephrase: assuming you have a 64 bit architecture, then sure: the CPU should be able to treat 64 bits with a single instruction. Be that for reading the whole array into the CPU cache, or a CPU register, be it for copying that data to a different position in memory.

But I completely agree with the comment by user Joachim: Big-O isn’t really applicable here. We use Big-O to determine/estimate the "cost" of an algorithm. It is not tool to assess the differences between implementations. Especially given the fact that the OP mixes in so many different levels/layers here. Details such as CPU cache or "register width" aren't elements that matter to Big-O!

So, as said, what we can state: 64 bits can be moved around in "one shot" on a system that supports 64 bit top to bottom.

edited Jan 12 '22 at 07:09

answered Sep 13 '19 at 13:20

GhostCat

137,827
25
176
248

1

It doesn’t matter how many layers are involved. The actual point is, the Big O notation tells, how an algorithm scales with large input. So the behavior for small or even fixed size input is entirely irrelevant for the time complexity. Regardless of the architecture, you can assume that copying two million elements will roughly take twice the time of copying one million elements. That’s what `O(n)` means. – Holger Sep 13 '19 at 16:17
1

@Holger It does on a certain level: because it defines the context you are within. Big-O for me is a mostly theoretical concept that helps us classifying *algorithms*. An algorithm isn't a program. Implementations depend on all aspects of the layers involved. But I get your point and will think about improving the wording. – GhostCat Sep 13 '19 at 17:21
1

Implementations can exhibit a behavior that doesn’t match the theoretical prediction of the Big O complexity. I suppose that’s where you are heading at. But that’s moot when the Big O doesn’t apply the the question in the first place. – Holger Sep 13 '19 at 17:27

score 1 · Answer 2 · answered Sep 17 '19 at 06:56

64-bit POSIX system.

POSIX has nothing to do with CPU features related to chunked data copying

Let's assume the processor cache contains my array

It would save from the main memory trip while executing copy instruction, but does not affect the order of the Big O notation.

the registers are 64+ bits long.

Even if you have AVX512 support on you architecture with 512 bits wide zmm registers and JDK 9+ (AFAIK runtime is aware of AVX512 starting JDK 9+) it would allow you to copy 8 packed 64-bit integers per one instruction, but does not affect the order of complexity.

So to copy, say, 1024 64-bit integers you would require to execute at least 128 vector instructions again yielding O(n) complexity, but with lower constant.

HotSpot implementation note:

The architecture-dependent code for arraycopy implementation is generated on JVM global bootstrapping "phase" here StubRoutines::initialize2.

Particular the chunked copy routine code generation is done in the platform dependent section of HotSpot code with copy_bytes_forward function (it is done with HotSpot's own Macro Assembler implementation).

The crucial parts of it is the CPU feature checks like

if (UseAVX > 2) {
  __ evmovdqul(xmm0, Address(end_from, qword_count, Address::times_8, -56), Assembler::AVX_512bit);
  __ evmovdqul(Address(end_to, qword_count, Address::times_8, -56), xmm0, Assembler::AVX_512bit);
} else if (UseAVX == 2) {
  __ vmovdqu(xmm0, Address(end_from, qword_count, Address::times_8, -56));
  __ vmovdqu(Address(end_to, qword_count, Address::times_8, -56), xmm0);
  __ vmovdqu(xmm1, Address(end_from, qword_count, Address::times_8, -24));
  __ vmovdqu(Address(end_to, qword_count, Address::times_8, -24), xmm1);
} else {
    //...
}

which produces the code based on the available CPU features. The features detector is generated and called earlier in architecture dependent generator generate_get_cpu_info based on cpuid instruction.

Can System.arraycopy(…) be faster than O(n)?

2 Answers2