14

This post is divided to two major sections. The first section introduces the original test cases & results, and my thoughts about it. The second section details the modified test case, and its results.

The original title of this topic was "Full iteration over an array significantly faster than with a linked list". The title was changed due to the newer test results (presented in Section Two).


SECTION ONE: The Original Test Case

For a full one-directional sequential traversal, it's known that the linked list and the array have similar performance, but due to the cache-friendliness (reference locality) of the contiguous array, it may perform (slightly) better. To see how it works in the practice (Android, Java), I examined the above claims, and made some measurements.

First of all, my naive assumptions. Let's take a look at the following class:

private static class Message {
    public int value1;
    public Message next;

    public Message(java.util.Random r, Message nextmsg) {
        value1 = r.nextInt();
        next = nextmsg;
    }
}

In the first measurement scenario, its next field will not be used at all. The below code creates an array of 1,000,000 Message instances, and then iterates over the array in a loop. It measures how much time the iteration takes.

Log.i("TEST", "Preparing...");          

final int cnt = 1000000;
int val = 0;
java.util.Random r = new java.util.Random();
Message[] messages = new Message[cnt];
for (int i = 0; i < cnt; i++) {
    messages[i] = new Message(r, null);
}           

Log.i("TEST", "Starting iterating...");
long start = SystemClock.uptimeMillis();

for (int i = 0; i < cnt; i++) {
    Message msg = messages[i];
    if (msg.value1 > 564645) {
        val++;
    }
}       
Log.w("TEST", "Time: " + (SystemClock.uptimeMillis() - start) + ", result: " + val);

The second measurement builds and measures a linked list of Message objects instead:

Log.i("TEST", "Preparing...");          

final int cnt = 1000000;
int val = 0;
java.util.Random r = new java.util.Random();
Message current = null;
Message previous = null;
for (int i = 0; i < cnt; i++) {
    current = new Message(r, previous);
    previous = current;
}
previous = null;

Log.i("TEST", "Starting iterating...");
long start = SystemClock.uptimeMillis();
while (current != null) {
    if (current.value1 > 564645) {
        val++;
    }
    current = current.next;
}           

Log.w("TEST","Time: " + (SystemClock.uptimeMillis() - start) + ", result: " + val);

The first test constantly produces 41-44 ms, while the second test gives 80-85 ms. The linked list iteration seems to be 100% slower.

My (possibly flawed) train of thought and issues is as follows. I'll welcome (in fact, encourage) any corrections.

OK, we can often read that an array is a contiguous memory block, and thus accessing its elements sequentially is more cache-friendly than in case of a linked list. But in our case, the elements of the array are only object references, and not Message objects themselves (in Java, we don't have value type i.e. struct as in C# that we could store embedded in an array). Therefore, the "locality of reference" only applies to array elements themselves, and these only specify the address of objects. Consequently, the Message instances (in general) could still be "anywhere" in the memory, so the "locality of reference" doesn't apply to the instances themselves. From this point, it looks like we are the same as in case of a linked list: the instances themselves may reside "anywhere" in memory: the array only guarantees that their references are stored in a contiguous block...

...and here comes the use-case: complete sequential traversal (iteration). First, let's examine how we get the references to instances in each case. In case of the array, it's very efficient, because they're in a contiguous block. But in case of the linked list, we're also good, because once we've accessed a Message instance (that's why we iterate!), we immediately have the reference to the next instance. And since we've accessed a field of Message already, accessing another field (the "next") should be supported by the cache (the fields of the same object also have a locality of references AFAIK, they're in a contiguous block, too). To sum up, it seems to break down to this:

  1. The array offers a cache-friendly iteration over references. Message instances themselves may be "anywhere" in memory, and we need to visit these, too.
  2. The linked list offers that the reference to the next element is obtained when the current Message instance is accessed. This is "free", because each Message instance must be visited anyway (just like in the array case).

So, based on the above, it looks like the array is not better than the linked list. The only exception is when the array is of primitive type (but in such a case, it wouldn't make sense to compare it with a linked list). So I'd expect that they perform similarly, but they didn't, as there was a huge difference. In fact, if we assume that the array indexing requires a range check each time an element is accessed, the linked list (in theory) could be faster, even. (The range check of the array access is probably optimized away by the JIT, so I understand that this is not a valid point.)

My guesses are the following:

  1. It's probably not the array's cache-friendliness that is responsible for the 100% difference. Instead, the JIT performs optimizations that can't be done in case of the linked list traversal. If the range check and (VM-level) null check are eliminated, then I guess the "array-get" bytecode instruction might be faster than my "field-get" (or whatever it's called) instruction in the linked list case (?).

  2. Even though the Message instances could be "anywhere" in memory, they are probably very close to each other, because they were allocated "at the same time". But 1,000,000 instances can't be cached, only a part of them. In such a case, sequential access would be cache-friendly both in the array and in the linked list case, so this doesn't explain the difference.

  3. Some intelligent "prediction" (prefetch) of the Message instance I'll access? I.e. somehow there is still cache-friendliness with the Message instances themselves, but only in case of the array access.

UPDATE: Since several comments were received, I'd like to react to them below.

@irreputable:

the linked list is visited from high address to low address. what if it's the other way around, i.e. next points to a newer object, not a previous object

Very good spot! I didn't think to this little detail that the layout may influence the test. I'll test it today and will return with the results. (Edit: results are here, I've updated this post with "Section 2").

@Torben comments:

Also I would say that this whole exercise seems pretty useless. You are talking about 4ms improvement over 100000 iterations. Seems like premature optimization. If you have a situation where this is a bottleneck, then please describe it and we can look into it (because it definitely would be a more interesting problem than this).

If it's not interesting to you, then you can disregard this topic (instead of posting 4 times). About your baseless assumption of "premature optimization" -- I'm afraid you read too much SO and perform too little industrial-level development. The concrete situation is in a simulation-related software which might have to traverse these lists several times per second. Indeed, 120+ ms latency may affect the responsiveness of an app.

I appreciate the thought you put into this, but I really can't find a question from your post. :) Edit: And array iteration is 50% faster. 100% faster would mean zero time consumed.

I'm sure it was pretty obvious from my post that I'm wondering why is the very significant difference present, when the arguments would imply otherwise. Thanks for the correction: indeed, I wanted to write that the linked list case is 100% slower.


SECTION TWO: The Modified Test Cases

irreputable had a very interesting observation for me:

the linked list is visited from high address to low address. what if it's the other way around, i.e. next points to a newer object, not a previous object

I changed the linked list structure such that the direction of its next pointers is equal to the instantiation order of its nodes:

Message current = null;
Message previous = new Message(r, null);
Message first = previous;
for (int i = 0; i < cnt; i++) {
    current = new Message(r, null);
    previous.next = current;
    previous = current;
}       
previous = current = null;

(Note that the creation algorithm might not be the most compact one, I think I knew a slightly nicer way.) The code that iterates through this linked list:

while (first != null) {
    if (first.value1 > 564645) {
        val++;
    }
    first = first.next;
}

And now the result I get is constantly 37-39 ms (OK, we can say it's exactly the array's performance, but actually, it's slightly faster in every test case, constantly.) Instead of the 80 ms of the "reversed-direction" linked list, it's twice as fast!

Then I made a similar test with the original array test case as well: I changed the array traversal to opposite direction (to a count-down loop):

for (int i = cnt - 1; i >= 0; i--) {
    Message msg = messages[i];
    if (msg.value1 > 564645) {
        val++;
    }
}

And the result is constantly 85-90 ms! The original test case produced 40-41 ms.

It seems there are two new conclusions (and one question) now:

  1. The original claim seems to be true that the array's "locality of reference" (due to the contigous memory block) doesn't provide an advantage in case of "reference-type" (i.e. object) arrays when they're compared with linked lists. This is because the object arrays only hold references to object instances, not the object instances themselves (which can be, in theory, "anywhere" in memory, just like in case of the linked list).

  2. In my test cases, the result seems to be dependent on the direction of traversal, even in case of the array scenario (!). How is this possible?

To sum up my test results:

  1. In "forward" direction traversal, the linked list slightly outperforms the array traversal (exactly as expected: we have the next reference immediately when a Message instance is obtained, i.e. even there is no need to access an array element for obtaining its address).

  2. In "backward" direction traversal, both have about 100% weaker performance (and the linked list also slightly outperforms the array).

Any ideas?

UPDATE 1: dlthorpe made very valuable comments. I'll copy them here, as they might help in finding an answer to this "riddle".

Is there any indication that the hardware implements a look-ahead page prefetch in the memory cache controller? Instead of loading only the memory page needed for a memory reference, also load the next higher page in anticipation of a forward progressive read? This would eliminate page load waits for forward progressions through memory but would not eliminate page load waits for reverse progressions through memory.

[..]

I'd suggest testing on radically different hardware. Most mobile devices are running some form of ARM SoC. See if the test cases show similar skew on Intel hardware, like a PC or a Mac. If you can dig up an old PowerPC Mac, even better. If these don't show similar results, then that would point to something unique on the ARM platform or its Java implementation.

[..]

Right, your access patterns are mostly sequential, but in different directions. If something underneath you is doing prefetch but only in one direction (prefetch next higher address block), then that will skew the results in favor of the tests that run in that direction.

UPDATE 2: I ran the tests on a PC (Core i7 Nehalem architecture from Feb. 2009, 8 GB RAM, Windows 7). I used C#.NET in a .NET 2.0 source code project (but the .NET 4 is installed on the machine). My results, with 25 million Message instances:

  • Linked list: 57-60 ms
  • Array: 60-63 ms

The direction of reading didn't seem to influence the results.

Thomas Calc
  • 2,994
  • 3
  • 30
  • 56
  • the linked list is visited from high address to low address. what if it's the other way around, i.e. `next` points to a newer object, not a previous object – irreputable Dec 14 '12 at 05:30
  • 1
    I appreciate the thought you put into this, but I really can't find a question from your post. :) Edit: And array iteration is 50% faster. 100% faster would mean zero time consumed. – Torben Dec 14 '12 at 08:27
  • 3
    Also I would say that this whole exercise seems pretty useless. You are talking about 4ms improvement over 100000 iterations. Seems like premature optimization. If you have a situation where this is a bottleneck, then please describe it and we can look into it (because it definitely would be a more interesting problem than this). – Torben Dec 14 '12 at 08:32
  • Also, your test is so completely flawed that I am starting to wonder if you are actually trolling us. You initialize the tests with different random number generators and thus execute a different number of integer addition operations in each test case and test run. None of your measurements can be reproduced or compared to anything. – Torben Dec 14 '12 at 10:26
  • @Torben: your last statement about random number generators is unclear to me -- if the measurement result is influenced by this, how is this possible that the linked list **always** performs worse than the array? If the different number of integer adding operation introduced a deviation in results, that deviation should affect **both** scenarios -- the integer addition doesn't "know" if it's run in an array or in an LL iteration... (Other comments are at the end of my post and at your answer post.) – Thomas Calc Dec 14 '12 at 16:50
  • @irreputable: your suggestion was very helpful! I got completely different results with that -- please see the new, second section of my original post. – Thomas Calc Dec 14 '12 at 17:55
  • 1
    @ThomasCalc That's interesting. Maybe you should create a new question about it (nobody reads the old questions) – irreputable Dec 14 '12 at 18:30
  • 2
    Is there any indication that the hardware implements a look-ahead page prefetch in the memory cache controller? Instead of loading only the memory page needed for a memory reference, also load the next higher page in anticipation of a forward progressive read? This would eliminate page load waits for forward progressions through memory but would not eliminate page load waits for reverse progressions through memory. – dthorpe Dec 14 '12 at 18:49
  • @dthorpe: What you say sounds very likely to me (I guessed something similar, but I'm not familiar with low-level memory management, so I was unable to think to anything technically accurate.) What do you suggest as the next step? The results are the same on my Android 2.2 device from June 2011 as on the brand-new Galaxy S3. Android is based on a linux kernel, so what you say must happen in that kernel, if I'm correct. – Thomas Calc Dec 14 '12 at 18:53
  • @dlthorpe: Ah, my bad, you mentioned cache controller, i.e. hardware, so disregard my comment about the linux kernel. If it's hardware, then probably it worked the same on ARM architectures in the beginning of 2011 as now (considering my test on Samsung Galaxy S3). – Thomas Calc Dec 14 '12 at 19:01
  • 1
    I'd suggest testing on radically different hardware. Most mobile devices are running some form of ARM SoC. See if the test cases show similar skew on Intel hardware, like a PC or a Mac. If you can dig up an old PowerPC Mac, even better. If these don't show similar results, then that would point to something unique on the ARM platform or its Java implementation. – dthorpe Dec 14 '12 at 19:15
  • I read the following material, it has interesting points (http://www.ecs.umass.edu/ece/andras/courses/ECE668/Mylectures/Prefetching-lecture.ppt), but it looks like only **sequential** access is required by such prefetching strategies, and that is given in all of my scenarios (just when it's 100% **slower**, the traversal is done along **decreasing** physical memory addresses). – Thomas Calc Dec 14 '12 at 19:17
  • 1
    Right, your access patterns are mostly sequential, but in different directions. If something underneath you is doing prefetch but only in one direction (prefetch next higher address block), then that will skew the results in favor of the tests that run in that direction. – dthorpe Dec 14 '12 at 19:19
  • @dlthorpe: I don't have a MAC/PowerPC, but I can try on my Core i7 PC (2009) with C#.NET or Java. I must deal with other duties, so for the time being, I might just put a bounty to the question and wait for someone else to "solve" the "riddle". – Thomas Calc Dec 14 '12 at 19:20
  • @dlthorpe: I made the tests with Microsoft.NET on a Core i7 PC with Windows 7, and the difference didn't appear, i.e. reading direction didn't matter. I also updated the end of my original post with details. – Thomas Calc Dec 14 '12 at 23:40
  • @StephenC: it's sad if length is your argument against it. It details "original research" (exactly as SO asks), instead of the typical one-line SO question "do my homework, please". – Thomas Calc Dec 15 '12 at 01:54
  • @StephenC: honestly, who does this question *hurt*? – Thomas Calc Dec 15 '12 at 01:57
  • If you want to present original research, write a research paper or start a blog. This is not a real question, WAY TOO LONG, not relevant to real programming (ergo off-topic), not amenable to objective answers and ... (in my opinion) tedious and boring. – Stephen C Dec 15 '12 at 02:03
  • 1
    There are 2 votes to close this question. Quoting SO: "We expect answers to be supported by **facts, references, or specific expertise**, but this question will likely solicit debate, arguments, polling, or extended discussion." Dear voters, which part of this question (or its answer) is not based on facts, references or expertise?! – Thomas Calc Dec 15 '12 at 02:04
  • And - frankly - campaigning for your question to stay open pretty "naff". – Stephen C Dec 15 '12 at 02:06
  • @StephenC: it's a pity that "why does array iteration performance depend on traversal direction on Android devices?" is not a real question for you. Not suitable for "objective answers"? It already has a very clear and objective answer. And yes, "boring" is subjective. "Too long" belongs to kindergarten, not to science/development. – Thomas Calc Dec 15 '12 at 02:07
  • @StephenC: "And - frankly - campaigning for your question to stay open..." -- may be, but spending your time for downvoting constructive questions and calling them "boring" also reveals something about yourself. – Thomas Calc Dec 15 '12 at 02:09
  • @ThomasCalc - nope. It is a **guess**. And all answers to this are going to be guesses unless someone has access to a hardware simulator to figure whether this is really due to prefetch prediction or one of the other possible explanations. – Stephen C Dec 15 '12 at 02:11
  • 1
    @StephenC: then is the accepted answer of this question also a **damn guess**? http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array I don't see any hardware simulator used there... – Thomas Calc Dec 15 '12 at 02:18
  • @ThomasCalc - Basically, yes. He didn't back his explanation up with concrete evidence ... in the form of simulator results or technical documentation. Without concrete evidence, his explanation is little more than an educated guess. (Of course, he may be right ... but we've no way to know.) The problem is that these behaviours depend on all sorts of things, some of which are hardware dependent and difficult to measure quantify. And the interactions can be equally complicated. This is illustrated by the fact that he saw different results to yours. – Stephen C Apr 05 '13 at 13:41

2 Answers2

4

Talking about PC hardware, early hardware prefetchers (say circa 2005) were better at detecting and prefetching forward accesses, but more recent hardware should be good at detecting both directions. If you are interested in mobile hardware, it is totally possible that it still implements basic forward-only prefetching.

Outside of proper prefetch implemented in the MMU, which actually detects access patterns, it is very common for hardware to get more than one cache line when a cache miss occurs. Often this takes the form of simply getting next cache line, in addition to the required one, when a miss occurs. This implementation would give the forward direction a big advantage by effectively halving the cache miss rate in that case (this assumes prefetching is ineffective).

Locally, on Core i7, I get slightly better results for the linked list version at ~3.3 ms for the whole iteration, vs 3.5 ms for the array version - when using the original program (which iterates the link list in reverse order of creation). So I don't see the same effect you did.

The inner loop for your test, checking the value of val, has a big impact. The current loop will cause a lot of mispredicts, unless the JIT compiler is smart enough to use CMOV or something similar. It seems that in my test, it was - since I got about 1 ns / iteration for small iteration counts that fit in L1. 1 ns (about 3 cycles) isn't consistent with a full branch mis-predict. When I changed it to do an unconditional val += msg.value1, the array version got a significant boost, even in 1,000,000 iteration case (which won't even fit in L3, probably).

Interestingly enough, the same transformation (val += msg.value1) made the linked list version slightly slower. With the transformation, the array version was considerably faster at small iteration counts (inside L2, and the two approaches were comparable outside). From caliper:

  length method         ns linear runtime
     100  ARRAY       63.7 =
     100 LINKED      190.1 =
    1000  ARRAY      725.7 =
    1000 LINKED     1788.5 =
 1000000  ARRAY  2904083.2 ===
 1000000 LINKED  3043820.4 ===
10000000  ARRAY 23160128.5 ==========================
10000000 LINKED 25748352.0 ==============================

The behavior for small iteration counts is easier to explain - the linked list, which has to use pointer chasing, has a data dependency between each iteration of the loop. That is, each iteration depends on the previous, because the address to load comes from the previous element. The array doesn't have this same data dependency - only the increment of i is dependent, and this is very fast (i is certainly in a register here). So the loop can be much better pipelined in the array case.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • I've just also updated my original post with my Core i7 results, and they're like yours: the linked list is a tad bit faster. And I also don't see the effect that occurs on mobile devices (HTC Desire HD, Samsung Galaxy S3). Perhaps the ARM SoC uses a basic forward-only prefetching then. I'll repeat the test once I get an x86 Android device, and will post the results here (it might be after months). – Thomas Calc Dec 14 '12 at 23:43
  • in fact, when reading your answer, I realized the following: perhaps the mobile hardware even doesn't apply any pattern-based prefetching, and what we see is purely the "more cache lines per cache miss" effect? If mobile hardware had any pattern based data prefetching, why would it work so primitively, only in one direction nowadays? Just guessing, of course. – Thomas Calc Dec 14 '12 at 23:51
  • Just a note regarding your newly added paragraph about mispredicts: on the mobile device (in Android), I made a test earlier where the linked list was *merely* traversed without any other operation (i.e. only "current = current.next"), and it still was slow. However, I didn't try the same with the array. – Thomas Calc Dec 14 '12 at 23:54
  • 1
    I'm partial to the "two cache line effect" because it explains the almost exactly 1/2 speed behavior... All this is a good example of how different hardware can have wildly different results - but remember also that the Dalvik JIT is wildly different than the Sun JIT. – BeeOnRope Dec 15 '12 at 00:08
1

I don't know the answer, but I would start with looking at the size of the generated bytecode. Since in the array case, the number of iterations is known (cnt is hardcoded and final), the compiler may have inlined some iterations, saving on the jump and comparisons instructions.

Also, if you know the basics of how a program works at the low-level layers, looking at the disassembled bytecode might give you some hints. Even if you are not fluent with assembler languages, it is not too hard to understand a simple program like yours (I was surprised at how much I could figure out the first time I saw some disassembled java code).

Hope this helps.

Djizeus
  • 4,161
  • 1
  • 24
  • 42
  • Thanks for the advice. I checked the bytecode, and there were some slight differences (and yes, the value of *cnt* is actually inlined literally), but nothing major that could reason the difference. On the other hand, I updated my original post with a second major section, where I have novel results. – Thomas Calc Dec 14 '12 at 17:54