In Unity, I noticed that I have been getting subpar performance in certain code logic compared to similar implementations in Kotlin. After profiling, I suspect that the language/runtime itself may somehow be slower. Therefore, I made a very short benchmark in both Kotlin and C# to measure the performance of basic operations:
The kotlin part is as follows. Note that Matrix4
and Vector3
are libGDX (a Java/Kotlin's game library) classes, and they are nothing more than just containers of data. The mul function multiplies the matrix with the vector, and stores the result in-place back into the vector.
fun benchmark(a: Matrix4, b: List<Vector3>) {
var i = 0;
while (i < 100000) {
b[i].mul(a);
++i;
}
}
var a = Matrix4(floatArrayOf(1f, 2f, 3f, 4f, 3f, 2f, 1f, 2f, 3f, 4f, 3f, 2f, 1f, 2f, 3f, 4f))
var b = List<Vector3>();
for (i in 0..100000) {
b.add(Vector3(3f, 2f, 1f));
}
// warmup JIT
for (i in 0..9) {
benchmark(a, b)
}
var t: Double = 0.0;
for (i in 0..9) {
t += measureNanoTime {
benchmark(a, b)
}.toDouble()
}
println(t / 10.0 / 1000000.0) // milliseconds
The Unity C# part is as follows. Note that M4
and V3
are helper classes created to match what libGDX had.
private void Benchmark(M4 a, List<V3> b)
{
var i = 0;
while (i < 100000)
{
b[i].mul(a);
++i;
}
}
var a = new M4(1f, 2f, 3f, 4f, 3f, 2f, 1f, 2f, 3f, 4f, 3f, 2f, 1f, 2f, 3f, 4f);
var b = new List<V3>();
for (int i = 0; i < 100000; ++i)
{
b.Add(new V3(3, 2, 1));
}
// warmup JIT
for (int i = 0; i < 10; ++i)
{
Benchmark(a, b);
}
var t = 0.0;
for (int i = 0; i < 10; ++i)
{
var s = (double) nanoTime();
Benchmark(a, b);
var e = (double) nanoTime();
t += e - s;
}
Debug.Log(t / 10.0 / 1000000.0); // milliseconds
The implementation of mul
is made to match libGDX's exact implementation (https://github.com/libgdx/libgdx/bl...x/src/com/badlogic/gdx/math/Vector3.java#L353).
The device is a mid-2015 MacBook Pro. Unity version is 2020.3.0f1, building to OSX standalone with Mono backend, not a development build.
The results are as follows:
- Kotlin: 0.3658762ms
- Unity C#: 1.74067ms (almost 4 times slower). If I change
M4
andV3
to be struct instead of class, it becomes even slower: 2.51ms (almost 6 times slower).
What would be the cause of such a significant difference?