Code injecting/assembly inlining in Java?

Question

I know Java is a secure language but when matrix calculations are needed, can I try something faster?

I am learning __asm{} in C++, Digital-Mars compiler and FASM. I want to do the same in Java. How can I inline assembly codes in functions? Is this even possible?

Something like this (a vectorized loop to clamp all elements of an array to a value without branching, using AVX support of CPU):

JavaAsmBlock(
   # get pointers into registers somehow
   # and tell Java which registers the asm clobbers somehow
     vbroadcastss  twenty_five(%rip), %ymm0
     xor   %edx,%edx
.Lloop:                            # do {
    vmovups   (%rsi, %rdx, 4), %ymm1
    vcmpltps   %ymm1, %ymm0, %ymm2
    vblendvps  %ymm2, %ymm0, %ymm1, %ymm1  # TODO: use vminps instead
    vmovups    %ymm1, (%rdi, %rdx, 4)
    # TODO: unroll the loop a bit, and maybe handle unaligned output specially if that's common
    add         $32, %rdx
    cmp         %rcx, %rdx
    jb     .Lloop                  # } while(idx < count)
    vzeroupper
);

System.out.println(var[0]);

I don't want to use a code-injector. I want to see the Intel or AT&T style x86 instructions.

If you're writing asm like that (16-bit registers and using `div` by 4 instead of a `shr al, 2`), [it's definitely *not* going to be faster than what a C compiler could make for you.](https://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466), so you should just use JNI with C or C++. ASM is only useful for performance if you know how to tune for the microarchitecture of current CPUs. This is a useful question, but the example is an example of why most people *shouldn't* use asm. — Peter Cordes, Oct 07 '17 at 18:24
You are right. Two things at the same time. I'd add something like an AVX dot product with proper order of instructions if I had enough experience at that time. — huseyin tugrul buyukisik, Oct 07 '17 at 18:53
You could edit the question to use something modern. Like maybe BMI2 `pdep`, which has no Java intrinsic. Ideally you could come up with something that you couldn't just as easily get a C compiler to emit for you, though. — Peter Cordes, Oct 07 '17 at 18:57
I had Intel's opencl-c compiler create a branchless "vectorized clamp to 25.0f" procedure and put only a part of it here(https://codeshare.io/29pqeB). Would you mind looking at it? Should I add full code or does it divert main idea of question to somewhere else? — huseyin tugrul buyukisik, Oct 07 '17 at 20:29
I fixed your asm to include the actual loop, instead of just the loop overhead but no branch. And optimized it to be something you might actually want to use for high performance. You used a signed 32-bit loop counter in a way that forced the compiler to sign-extend it inside the loop every iteration. — Peter Cordes, Oct 07 '17 at 22:09
Thank you very much. Actually I didn't tell compiler about how many(which should be multiple of 8 and large) elements are to be processed. It has chosen it somehow with assumption of CPU is an Intel and elements are less than 4G? I'm using fx8150. — huseyin tugrul buyukisik, Oct 07 '17 at 22:18
Well the only source you included was a function for 8 floats from memory. It's up to you to put it in a loop. And what makes you think that it decided to optimize specifically for Intel? Splitting 256b stores is good for Piledriver even if they're aligned, because of a CPU performance bug or something with AVX stores. If tuning specifically for piledriver, maybe using only XMM instructions would have been even better, but that compiler output would be ok. Anyway, the asm in the question is now a good generic example that doesn't distract readers with any uarch tuning. — Peter Cordes, Oct 07 '17 at 22:27
There was a warning in its documentation that it is optimized for Intel only but the code generated is as fast as I need at least. This is "code-builder" add-on of visual studio by Intel. — huseyin tugrul buyukisik, Oct 07 '17 at 22:29
The codeshare link has a `.ident "clang version 3.6.2 "` line. So presumably you're using an old clang version. — Peter Cordes, Oct 07 '17 at 22:36
I didn't know Intel was using clang for opencl compiler :) Maybe its better than gcc 6.x which resisted to compile the way I need(in linux at least but now im on windows). — huseyin tugrul buyukisik, Oct 07 '17 at 22:39
Or is clang already in windows(somehow preinstalled with windows) and it uses it just like ubuntu had gcc by default? — huseyin tugrul buyukisik, Oct 07 '17 at 22:46
Just like you would probably not use C++ for business applications (or not all of them) and not Python to write an operating system, it seems that Java is not the right tool for the task at hand here — IceFire, May 26 '20 at 05:56

Ernest Friedman-Hill · Accepted Answer · 2012-07-24T13:48:12.833

There is a layer of abstraction between your Java code and the underlying hardware that makes this kind of thing impossible in principle; you technically can't know how your code is represented on the underlying machine, since the same bytecode can run on different processors and different architectures.

What you officially can do is use the Java Native Interface (JNI) to call native code from your Java code. The call overhead is substantial, and sharing data with Java is fairly expensive, so this should be used only for decent-sized chunks of native code.

In theory, such an extension should be possible, though. One can imagine a Java compiler that targeted a specific platform and allowed assembly escapes. The compiler would have to publish its ABI, so you'd know the calling conventions. I'm not aware of any that do, however. But there are several compilers available that compile Java directly to native code; it's possible one of them supports something like this without my knowing, or could be extended to do so.

Finally, on a different level altogether, there are bytecode assemblers for the JVM, like Jasmin. A bytecode assembler lets you write "machine code" that targets the JVM directly, and sometimes you can write better code than the javac compiler can generate. It's fun to play with, in any event.

Of the available ahead-of-time Java to native code compilers, [Excelsior JET](http://www.excelsiorjet.com) only implements JNI, whereas [GCJ](http://gcc.gnu.org/java/) supports both JNI and also its own interface called [CNI](http://gcc.gnu.org/onlinedocs/gcj/About-CNI.html). — Dmitry Leskov, Jul 26 '12 at 10:57
To be clear, the overhead is only "substantial" if you consider a few 10s of cycles substantial (the typical overhead of a JNI call) - for methods such as the above that operate over an array of reasonable size, the JNI overhead should disappear in the noise (as long as data passing is done right, e.g,. with the `Get*Critical` functions to operate directly on the underlying array). — BeeOnRope, Jun 14 '18 at 17:47

Pyves · Answer 2 · 2018-06-14T17:04:59.200

You cannot directly inline assembly in your Java code. Nevertheless, contrarily to what is claimed by some other answers, conveniently calling assembly without going through any intermediary C (or C++) layer is possible.

Quick walkthrough

Consider the following Java class:

public class MyJNIClass {

    public native void printVersion();

}

The main idea is to declare a symbol using the JNI naming convention. In this case, the mangled name to use in your assembly code is Java_MyJNIClass_printVersion. This symbol must be visible from other translation units, which can for instance be achieved using the public directive in FASM or the global directive in NASM. If you're on macOS, prepend an extra underscore to the name.

Write your assembly code with the calling conventions of the targeted architecture (arguments may be passed in registers, on the stack, in other memory structures, etc.). The first argument passed to your assembly function is a pointer to JNIEnv, which itself is a pointer to the JNI function table. Use it to make calls to JNI functions. For instance, using NASM and targeting x86_64:

global Java_MyJNIClass_printVersion

section .text

Java_MyJNIClass_printVersion:
    mov rax, [rdi]
    call [rax + 8*4]  ; pointer size in x86_64 * index of GetVersion
    ...

Indexes for JNI functions can be found in the Java documentation. As the JNI function table is basically an array of pointers, don't forget to multiply these indexes by the size of a pointer in the targeted architecture.

The second argument passed to your assembly function is a reference to the calling Java class or object. All subsequent arguments are the parameters of your native Java method.

Finally, assemble your code to generate an object file, and then create a shared library from that object file. GCC and Clang can perform this last step with a command similar to gcc/clang -shared -o ....

Additional resources

A more comprehensive walkthrough is available in this DZone article. I have also created a fully runnable example on GitHub, feel free to take a look and play around with it to get a better understanding.

Well it's using the same JNI implementation as with C or C++, but yes, from a lower-level. ;-) — Pyves, May 12 '17 at 08:08
You could have written `mov rax, [rdi]` / `call [rax + 8*4]`. x86 addressing modes are more efficient than extra instructions. memory-indirect call isn't faster than load + call, but it isn't slower either and saves code size and decode bandwidth. (Hmm, actually according to http://agner.org/optimize/, it might be slower on AMD, since it's more than 2 uops and that means VectorPath (microcoded), not DirectPath. If tuning for AMD, maybe `mov rax, [rdi]` / `mov rax, [rax + 8*4]` / `call rax`. Still no ADD instruction, that's always worse) — Peter Cordes, Oct 07 '17 at 18:16
@PeterCordes Thanks for these insights, I have modified my answer accordingly. I'll also go and modify the code on the repository, unless you're interested in submitting a pull request? — Pyves, Oct 08 '17 at 15:54
I don't have a Java dev env set up to check that I didn't break something, so go ahead and change it yourself. — Peter Cordes, Oct 08 '17 at 19:20
No worries, done [here](https://github.com/PyvesB/JavAssembly/commit/54ec969df9f4817d36dadb5b6b20d7b05a626490). — Pyves, Oct 08 '17 at 21:53

score 2 · Answer 3 · answered Dec 31 '14 at 00:31

2

It is possible to call assembly from Java using the Machine Level Java technology. It transparently packs your assembly code, written in Java, but very similar to the most used assembly syntax, into a native library. And next you just need to call a native method, that you define in the same class, where your assembly is written. So, you always stay within Java environment and have no need to switch from Java IDE to some assembly tools and then back to Java.

answered Dec 31 '14 at 00:31

alexbav

36
3

Looks like the API you are suggesting lacks of documentation. Can you provide more details ? – Nicola Ferraro Dec 31 '14 at 00:58
Lower api/interface latency than jni? – huseyin tugrul buyukisik Dec 31 '14 at 21:06

score 1 · Answer 4 · edited Sep 07 '17 at 13:01

1

You cannot call assembly directly from Java. But you can call C code via JNI, and from there you can call assembly.

This article shows how.

edited Sep 07 '17 at 13:01

Cody Gray - on strike

239,200
50
490
574

answered Jul 24 '12 at 13:43

andrew cooke

45,717
10
93
143

very nice. i will try that. i am using digital mars compiler. do you think it is possible with __asm? Nwm i will try myself. thanks – huseyin tugrul buyukisik Jul 24 '12 at 13:44
as far as i can remember, you can use whatever c compiler you like. java simply uses the platform abi. – andrew cooke Jul 24 '12 at 13:45
2

You can write a function in assembly that follows the C ABI, and thus can be called the same as a C function. Basically, whatever you would do in a C function to make it JNI compatible, you can do in asm. – Peter Cordes Oct 07 '17 at 18:10

score 1 · Answer 5 · answered Jul 24 '12 at 13:44

1

You use JNI or JNA and call your native functions from Java. Or as an alternative, you have bytecode as InputStream and make a Java class out of it.

answered Jul 24 '12 at 13:44

belgther

2,544
17
15

score 1 · Answer 6 · answered Jul 26 '12 at 10:51

1

You may also wish to have a look at Aparapi.

answered Jul 26 '12 at 10:51

Dmitry Leskov

3,233
1
20
17

isnt aparapi for parallel programming for GPU ? – huseyin tugrul buyukisik Jul 28 '12 at 14:08
4

Yes. Did not you ask how to do matrix calculations faster? – Dmitry Leskov Jul 29 '12 at 16:34

Code injecting/assembly inlining in Java?

6 Answers6

Linked