1

In Unity, I noticed that I have been getting subpar performance in certain code logic compared to similar implementations in Kotlin. After profiling, I suspect that the language/runtime itself may somehow be slower. Therefore, I made a very short benchmark in both Kotlin and C# to measure the performance of basic operations:

The kotlin part is as follows. Note that Matrix4 and Vector3 are libGDX (a Java/Kotlin's game library) classes, and they are nothing more than just containers of data. The mul function multiplies the matrix with the vector, and stores the result in-place back into the vector.

fun benchmark(a: Matrix4, b: List<Vector3>) {
    var i = 0;
    while (i < 100000) {
        b[i].mul(a);
        ++i;
    }
}
 
var a = Matrix4(floatArrayOf(1f, 2f, 3f, 4f, 3f, 2f, 1f, 2f, 3f, 4f, 3f, 2f, 1f, 2f, 3f, 4f))
var b = List<Vector3>();
for (i in 0..100000) {
    b.add(Vector3(3f, 2f, 1f));
}
// warmup JIT
for (i in 0..9) {
    benchmark(a, b)
}
var t: Double = 0.0;
for (i in 0..9) {
    t += measureNanoTime {
        benchmark(a, b)
    }.toDouble()
}
println(t / 10.0 / 1000000.0) // milliseconds

The Unity C# part is as follows. Note that M4 and V3 are helper classes created to match what libGDX had.

private void Benchmark(M4 a, List<V3> b)
{
    var i = 0;
    while (i < 100000)
    {
        b[i].mul(a);
        ++i;
    }
}
 
var a = new M4(1f, 2f, 3f, 4f, 3f, 2f, 1f, 2f, 3f, 4f, 3f, 2f, 1f, 2f, 3f, 4f);
var b = new List<V3>();
for (int i = 0; i < 100000; ++i)
{
    b.Add(new V3(3, 2, 1));
}
// warmup JIT
for (int i = 0; i < 10; ++i)
{
    Benchmark(a, b);
}
var t = 0.0;
for (int i = 0; i < 10; ++i)
{
    var s = (double) nanoTime();
    Benchmark(a, b);
    var e = (double) nanoTime();
    t += e - s;
}
Debug.Log(t / 10.0 / 1000000.0); // milliseconds

The implementation of mul is made to match libGDX's exact implementation (https://github.com/libgdx/libgdx/bl...x/src/com/badlogic/gdx/math/Vector3.java#L353).

The device is a mid-2015 MacBook Pro. Unity version is 2020.3.0f1, building to OSX standalone with Mono backend, not a development build.

The results are as follows:

  • Kotlin: 0.3658762ms
  • Unity C#: 1.74067ms (almost 4 times slower). If I change M4 and V3 to be struct instead of class, it becomes even slower: 2.51ms (almost 6 times slower).

What would be the cause of such a significant difference?

TommyX
  • 83
  • 2
  • 12
  • Well, there's a reason things like [`Vector3`](https://learn.microsoft.com/dotnet/api/system.numerics.vector3) exist, with associated [magic](https://learn.microsoft.com/dotnet/standard/simd). the runtime is not particularly suited to optimize general number crunching when using general code (to "see through" the primitives). The Java VM also has a big head start on the .NET VM in terms of optimizing, though lots of work is certainly being done in this department. – Jeroen Mostert May 25 '21 at 18:10
  • The two implementations aren't the same. And using doubles to count time is unsafe and subject to both scaling and rounding issues. Use `Stopwatch` instead. – Panagiotis Kanavos May 25 '21 at 18:11
  • The importance of SIMD CPU instructions and types that use them like Vector3 can't be stressed enough. All CPUs since the 2000s can work on multiple floats/ints/doubles at a time. They also have optimized 3D transformations, performed by 4x4 matrix multiplication. Vector3 and Matrix4x4 aren't just for convenience. Many of their operations are actually performed using SIMD instructions – Panagiotis Kanavos May 25 '21 at 18:17
  • @PanagiotisKanavos I used https://stackoverflow.com/a/44136515/3308553 to find the `nanoTime`, which uses `Stopwatch` underneath. Also, from libGDX's source code (https://github.com/libgdx/libgdx/bl...x/src/com/badlogic/gdx/math/Vector3.java#L353), it seems like their classes aren't explicitly using SIMD for the particular function I'm calling in the benchmark. – TommyX May 25 '21 at 18:27
  • If you need to do a lot of vector and other maths rather use Unity's [JobSystem](https://docs.unity3d.com/Manual/JobSystem.html) and the [Burst Compiler](https://docs.unity3d.com/Packages/com.unity.burst@0.2/manual/index.html)! – derHugo May 25 '21 at 18:31
  • The C# class is Vector3, not V3. Using Stopwatch doesn't remove the problems caused by the unnecessary floating point operations. Just create a `Stopwatch` before the loop, call `Stop` at the end and check the `Elapsed`, `ElapsedTicks` or `ElapsedMilliseconds` property. Floating point subtraction suffers from scaling issues. So does addition. And division *definitely* does. – Panagiotis Kanavos May 25 '21 at 18:32
  • Even assuming they're not using SIMD and the JVM doesn't automatically optimize to SIMD under the covers based on the class shape (which it might very well be doing, for all I know), why *wouldn't* you compare the best possible outcomes for every scenario? If you really want to do a VM-by-VM comparison and make it meaningful, you're going to have to drop down to bytecode and then machine code in any case (for both cases). The languages aren't going to be the interesting bit. For .NET in particular [benchmarkdotnet](https://benchmarkdotnet.org/) is a gold standard (for Java, no idea). – Jeroen Mostert May 25 '21 at 18:33
  • The performance of the scripting engine doesn't matter if the expensive stuff (3d transformations, rotations etc) is fast. That's why Unity3D exists in the first place. It's nowhere near as fast as custom gaming frameworks but that's fast enough because the expensive stuff is performed by the engine's optmized code, not the script – Panagiotis Kanavos May 25 '21 at 18:37
  • Besides, if you want to compare Kotlin and C#, you should use Kotlin and C# directly, not on top of Unity. Compare the performance of a .NET Core and a Kotlin application doing the same things, using their own runtimes – Panagiotis Kanavos May 25 '21 at 18:38
  • @PanagiotisKanavos Tried the benchmark with the `Stopwatch` method you suggested and the results are the same. Also, it is well-known that Unity's performance in 2D is very slow *if you use game objects only* (i.e. the standard way), forcing developers to write custom rendering solutions for anything graphically intensive such as Danmaku games. I understand that comparing kotlin and C# directly is more fair, but in my case, I'm not trying comparing languages on their own. – TommyX May 25 '21 at 18:52
  • @JeroenMostert Right. I understand that there are certain ways to compare the languages further. I am only worried that C# in Unity is slower than Kotlin by this much *even when the code is equivalent in this case* with no apparent cause. If this is unfortunately true (i.e. the above benchmark on its own is actually correct with no easy remedy), then I have to keep this fact in mind in future optimizations in my game. – TommyX May 25 '21 at 18:56
  • Use more accurate tools for benchmarking. For JVM it's jmh. Also see https://stackoverflow.com/questions/65127906/kotlin-measuretime-differs-from-kotlinx-benchmark-jmh-by-far – Михаил Нафталь May 25 '21 at 22:05

1 Answers1

0

The performance of C# could be better if it's done in a different way.

First of all, Unity Engine when using MonoBehaviour, most the code under the MonoBehaviour classes are Single Threaded which means it might take ridiculous time to complete the code especially when more mathematics are involved.

Then, Unity not only compile and runs code, apart from running code it also does rendering, some calculations to get the performance of the game in the editor, many process will be running in the background just to run Unity, some RAM and most CPU power will be spent to run Unity Engine and it's editor, So the Code can't get the most power from hardware.

So, to get the most out from Unity and C#,

Try Unity's DOTS (Data Oriented Technology Stack) which is a game changer. Burst works by compiling a subset of the C# language, known as High-Performance C# (HPC#), to make efficient use of a device’s power by deploying advanced optimizations built on top of the LLVM compiler framework. Burst is great for exploiting hidden parallelism in your applications.

DOTS's primary feature is to run the game in multiple threads and utilizing the hardware using JobSystem and ComponentSystem and much more.

Most of the classes gets it's new name in DOTS which is Burst Compiler Supported unlike Vector3, burst has float3, float4 which increases the performance drastically.