What optimizations can I expect from Dalvik and the Android toolchain?

Question

I'm working on a high-performance Android application (a game), and though I try to code for readability first, I like to keep in the back of my mind a picture of what is happening under the hood. With C++, I've developed a fairly good intuition about what the compiler will and won't do for me. I'm trying to do the same for Java/Android.

Hence this question. I could find very little about this topic on the web. Will the Java compiler, Dalvik converter (dx) and/or JITter (on Android 2.2+) perform optimizations like the following?

Method inlining. Under what conditions? private methods can always safely be inlined; will this be done? How about public final methods? Methods on objects of other classes? static methods? What if the runtime type of the object can easily be deduced by the compiler? Should I declare methods as final or static wherever possible?
Common subexpression elimination. For example, if I access someObject.someField twice, will the lookup be done only once? What if it's a call to a getter? What if I use some arithmetic expression twice; will it be evaluated only once? What if I use the result of some expression, whose value I know not to change, as the upper bound of a for loop?
Bounds checking on array lookups. Will the toolchain eliminate this in certain conditions, like the archetypical for loop?
Value inlining. Will accesses to some public static final int always be inlined? Even if they're in another class? Even if they're in another package?
Branch prediction. How big an issue is this even? Is branching a large performance hit on a typical Android device?
Simple arithmetic. Will someInt * 2 be replaced by someInt << 1?

Etcetera...

This might be useful: http://developer.android.com/guide/practices/design/performance.html — pablochan, Feb 06 '11 at 10:32
These might also be useful: http://www.netmite.com/android/mydroid/dalvik/docs/dexopt.html, http://www.taranfx.com/android-internals-jit-froyo, and the pdf version of the presentation: http://www.android-app-developer.co.uk/android-app-development-docs/android-jit-compiler-androids-dalvik-vm.pdf — Lior, Feb 06 '11 at 11:11
@pablochan: That one's been under my pillow for some time, but thanks :) @Lior: Good references, those are new to me, thanks! — Thomas, Feb 06 '11 at 11:35
"someInt * 2" and "someInt << 2" aren't exactly doing the same thing ;) I take it you mean "someInt << 1" and I edited your question... — SyntaxT3rr0r, Feb 06 '11 at 13:36
D'oh, thanks :) (Even then, they might have subtly different semantics in Java w.r.t. overflow or negative numbers? Dunno, never checked...) — Thomas, Feb 06 '11 at 13:39
It should be possible to write an optimizer for .class (or .dex) files, which can at least implement some of these optimisations (e.g. hoisting with local or final values). I was hoping Proguard would do it but it looks like they don’t; probably they never bothered because the JVM already does it. — Laurens Holst, Jul 28 '13 at 16:55
This is a really good question. Although some of your considerations could be reduced by running code optimiser like ProGuard. — Igor Čordaš, Jun 24 '14 at 20:06

score 106 · Accepted Answer · answered Feb 08 '11 at 07:20

This is Ben, one of the engineers working on the JIT @ Google. When Bill and I started on this project, the goal was to deliver a working JIT as soon as possible with minimal impact to resource contention (eg memory footprint, CPU hijacked by the compiler thread) so that it can run on low-end devices as well. Therefore we used a very primitive trace based model. That is, the compilation entity passed to the JIT compiler is a basic block, sometimes as short as a single instruction. Such traces will be stitched together at runtime through a technique called chaining so that the interpreter and code cache lookup won't be invoked often. To some degree the major source of speedup comes from eliminating the repeated interpreter parsing overhead on frequently executed code paths.

That said, we do have quite a few local optimizations implemented with the Froyo JIT:

Register allocation (8 registers for v5te target since the JIT produces Thumb code / 16 registers for v7)
Scheduling (eg redundant ld/st elimination for Dalvik registers, load hoisting, store sinking)
Redundant null check elimination (if such redundancy can be found in a basic block).
Loop formation and optimization for simple counted loops (ie no side-exit in the loop body). For such loops, array accesses based on extended induction variables are optimized so that null and range checks are only performed in the loop prologue.
One entry inline cache per virtual callsite w/ dynamic patching at runtime.
Peephole optimizations like power-reduction on literal operands for mul/div.

In Gingerbread we added simple inlining for getters/setters. Since the underlying JIT frontend is still simple trace based, if the callee has branches in there it won't be inlined. But the inline cache mechanism is implemented so that virtual getters/setters can be inlined without problems.

We are currently working on enlarging the compilation scope beyond a simple trace so that the compiler has a larger window for code analysis and optimization. Stay tuned.

You signed up specifically to answer this? Thank you! Too bad that operators are so slow in rolling out Gingerbread; can't count on those optimizations for at least another year I think. This is all good stuff, but since I'm not a compiler writer, I'm having difficulty seeing how to apply it in practice. In particular: ... — Thomas, Feb 08 '11 at 07:36
(1) Is loop optimization also done if the loop's upper bound depends on a non-final variable (e.g. a field), or should I store the upper bound in a final local variable? (2) Same question for all fields that are accessed repeatedly. (3) Should I declare my methods `final` wherever possible? Or does that still count as a virtual callsite? (4) Inlining in general is not being done at the JIT level, but maybe you happen to know whether the compiler and/or dx tool do it? — Thomas, Feb 08 '11 at 07:38
(1) as long as the upper/lower bound is loop invariant, declare a local variable and read it outside the loop is the cleanest approach. (2) The JIT doesn't do loop-invariant optimization yet. So again you will want to burn a local variable to read it once outside the loop. (3) Yes please (4) We want to maintain the ability for the developer to debug the code so dx doesn't perform inlining. But soon the JIT will inline aggressively. — Ben, Feb 08 '11 at 07:52
@Ben +1 can you recap at this time which are the optimization done by Ant, considering that this question does still make sense. That would be awesome! — loretoparisi, Feb 29 '16 at 18:05

score 10 · Answer 2 · answered Feb 06 '11 at 12:28

10

I'm sure that my answer will not answer all of your questions but I guess it is a win if it answers even one.

You seem to have a profound knowledge on the subject and know what you want so you might want to do the following. Build an example application containing the aspects you want to investigate on.

Take the APK you get and run it through the APK Tool. Reverse engineering your own code to do just what you are intending is perfectly fine as we know.

The APK Tool will extract and decode your resources and will reverse engineer .dex files to .smali files. You might want to look up the smali project too to get more information about how to read the .smali files and about its limitations.

Again I'm pretty sure that this is not going to answer all of your questions but it might be a good start.

answered Feb 06 '11 at 12:28

Octavian Helm

39,405
19
98
102

1

Good answer, thanks. That line of investigation hadn't occurred to me yet, mainly because it would take a lot of time. This would show at least what the Java compiler and dx are doing, although the effects of the JITter remain uncertain. If I do feel inquisitive and go down this road, I'll definitely post my results back here. – Thomas Feb 06 '11 at 13:02
Yes please do that. I'm quite interested about the outcome myself. – Octavian Helm Feb 06 '11 at 13:05
1

javac does a few optimizations but nothing dramatic. "dx" provides a faithful conversion of its input. As Ben pointed out, if these things weren't true you'd struggle a lot with debuggers. For an example of this in practice, see http://groups.google.com/group/android-platform/browse_thread/thread/e4749164474fb429/93901e2e43a657c8 (especially the part where "dx" produces better code if you don't pass "-g" to javac). You should also look into ProGuard's optimizations. – fadden Feb 08 '11 at 23:24

JesusFreke · Answer 3 · 2011-02-07T03:42:55.943

First, let me preface this by saying that I'm not an expert on dalvik, and some of my responses may be wrong. But I have dug into the JIT code in dalvik, and I'm quite familiar with the bytecode that dalvik runs.

Method inlining - as far as I know, this never happens. I'm almost positive it never happens at the bytecode level, and I don't think it happens at the JIT level currently - although it might in the future.
Common subexpression elimination - I believe this would only be done for subexpressions that don't use any non-final variables/fields. I'm not entirely positive if it would happen even then. If it is done, I would expect it to be done at the bytecode level, probably not the JIT level.
Bounds checking on array lookups - no clue
Value inlining - As far as I know, yes - they will be inlined in all of those scenarios.
Branch prediction - not sure
Simple arithmetic - not as far as I know

Also, I'd like to mention another avenue of approach to you - dx and dalvik are both open source, so you can dig into them all you like. Although, they're obviously not small codebases, so would take a fair bit of effort to dig into them at that level

Well, if this is anything to go by, I've been doing well in hand-inlining my methods and caching subexpression results. Thanks! — Thomas, Feb 07 '11 at 08:48

What optimizations can I expect from Dalvik and the Android toolchain?

3 Answers3

Linked