1

I'm looking for a bytecode optimizing algorithm that computes unnecessary calculations and removes dead local variables. Imagine this code:

iconst_2
iconst_3
iadd
invokestatic someMethod (I)V

It can easily be optimized to:

iconst_5
invokestatic someMethod (I)V

My first idea to implement this was to use a custom ASM interpreter that captures the instructions (like SourceInterpeter) and automatically computes stack values. Then replace the last calculation with known outcome (in this case iadd) with the real value, and iterate backwards to remove all captured instructions. This would work on many cases, but not on every case. Here the optimization would be a bit harder:

iconst_2
iconst_3
dup
getstatic someUnknownInt I
iadd
invokestatic someOtherMethod (I)V
iadd
invokestatic someMethod (I)V

If iadd is replaced by iconst_5, and iconst_2 and iconst_3 get removed, then we would have invalid bytecode. My second thought was to do multiple passes. First, replace all dup instructions with their real value as new push instruction (if it isn't a method / field reference), then remove all pop instructions (but don't remove methods), and finally compute calculations. This also doesn't work every time.

iconst_2
invokestatic someMethod ()I
dup2
invokestatic someOtherMethod (I)V
swap
invokestatic someOtherMethod (I)V
iadd //this can be computed -> iconst_4
invokestatic someOtherMethod (I)V

The last thing i tried was to use a SourceInterpreter and rewrite the instructions. Only obligatory instructions (like method calls, monitorenter, returns, etc.) get added, with the stack they take as rewritten code before them. This leads to invalid bytecode, as pop operations are not kept (after methods) and duplication of method calls (if the next obligatory instruction takes the previous method as argument). What is the best way to do this?

Aura Lee
  • 416
  • 3
  • 11
  • Are you sure this kind of optimization ist not already done by the compiler? – Ackdari Apr 21 '20 at 09:12
  • It is mostly done by the compiler, but my project targets obfuscated files. – Aura Lee Apr 21 '20 at 09:15
  • Also (in your third example) how do you know you can replace the `iadd` with `inconst_4` if you don't know what the called methodes do? Please elaborate on that. – Ackdari Apr 21 '20 at 09:21
  • because before iadd, only two iconst_2 remain on the stack. 2+2=4 – Aura Lee Apr 21 '20 at 10:26
  • 3
    `SourceInterpreter` is too simple. You’d need to implement your own interpreter. E.g., use the `ConstantTracker` of [this answer](https://stackoverflow.com/a/48806265/2711488) as a starting point. Some of the optimizations you have in mind can be implemented within the `merge` method. For more complex operations, you’d have to replace `ConstantValue` with a new type representing compound expressions. Then, perform optimizations at that expression level and implement a way to convert them back to bytecode. – Holger Apr 21 '20 at 10:33
  • I already have done almost the same as you said, an Interpreter with Values that combine the instructions (e.g. BinaryOpValue takes two CodeReferenceValue [abstract] as arguments). (If you are unsure what i mean: [link](https://github.com/GraxCode/threadtear/tree/master/src/me/nov/threadtear/analysis/full)). The problem IS the way of converting them back to bytecode. As i said it's not really possible to use Interpreter here, as methods that pop off the stack are lost. – Aura Lee Apr 21 '20 at 10:43
  • If your goal is just reverse engineering obfuscated code, the Krakatau decompiler might be of some use, since it already performs a number of simplifications like this (though it doesn't try to handle the third example either). – Antimony Apr 21 '20 at 13:01
  • My goal is not to decompile, my goal is to optimize. – Aura Lee Apr 21 '20 at 13:11
  • 1
    Isn’t the analyzer already interpreting the effect of the `dup` instruction? This would imply that generating the bytecode for the resulting expressions should replicate the necessary instructions. Seems to be related to [this Q&A](https://stackoverflow.com/q/60969392/2711488). – Holger Apr 21 '20 at 17:57
  • Imagine, `invokestatic method()I` and then `pop`. That method would never be related to some other method, as it is immediately removed from the stack. It would get lost and not end up in the optimized code. – Aura Lee Apr 21 '20 at 20:05
  • 1
    But there should be a corresponding `naryOperation` call. So if you remember this expression (generally, all expressions with potential side effects) and after finishing the analysis, when it doesn’t appear as a subexpression of another construct which consumes it, you know that it is only executed for its side effect. Then, when it doesn’t have a `void` return type, you may insert a `pop` again. – Holger Apr 22 '20 at 10:19
  • You are right. Rewriting code during interpretation is probably the solution. – Aura Lee Apr 23 '20 at 16:05
  • i took a look at the Interpreter class, and it seems like POPs are not being interpeted. I would have to rewrite the Analyzer for that. – Aura Lee Apr 26 '20 at 08:49

0 Answers0