2

I have a large array of small structs that I would like to serialize to file.

The struct:

public struct Voxel
{
  public byte density;
  public byte material;
}

While there are quite a few serialization libraries that can do general serialization very efficiently, I suspect we can do even better in terms of on disk size and serialization/deserialization speed, given we know and control this struct.

This struct is pretty final, so we can do without the fancy versioning of many serialization library supports.

From my search, it seems like Marshal might be a decent way to do such a thing, but I don't want to worry about things like Endianness.

So I wonder, what might be some good ways to serialize such data. Assuming the array size can be anywhere from 100 to 1mil?

(Also assuming we are not afraid to store them in different formats such that RLE can reduce the on-disk size even more.)

Shams Tech
  • 105
  • 8
bitinn
  • 9,188
  • 10
  • 38
  • 64
  • "endianness" doesn't actually apply here - that only becomes an issue if you have multi-byte values; for two separate `byte` values: you're fine – Marc Gravell Jun 08 '21 at 08:50

2 Answers2

3

Assuming you're using a recent framework: spans are your friend. As a trivial way of writing them:

Voxel[] arr = ...
var bytes = MemoryMarshal.Cast<Voxel, byte>(arr);
using (var s = File.OpenWrite("some.path"))
{
    s.Write(bytes);
}

Reading is a little harder, but not much:

Voxel[] arr;
using (var s = File.OpenRead("some.path"))
{
    int len = checked((int)(s.Length / Unsafe.SizeOf<Voxel>())), read;
    arr = new Voxel[len];
    var bytes = MemoryMarshal.Cast<Voxel, byte>(arr);
    while (!bytes.IsEmpty && (read = s.Read(bytes)) > 0)
    {
        bytes = bytes.Slice(read);
    }
}

Note that this presumes you want to start with a vector (Voxel[]); if you're happy to take a futher step down the rabbit hole, "memory mapped files" are also an option here, again using Span<T> (or Memory<T>) - then it becomes truly zero copy (your live data is the file, via OS magic).

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • Thx, unfortunately I am still on .net standard 2.0, so the problem is whether we can use these API with backport dlls, and not conflict with existing package dependencies. If there are some alternatives in 2.0 land I would love to hear. – bitinn Jun 08 '21 at 15:44
  • This feels silly but I couldn't find the right DLL to get `FileStream.Write(ReadOnlySpan)` API to work on .Net Standard 2.0 – bitinn Jun 08 '21 at 16:04
  • @bitinn it is in .net core 3+, and netstandard 2.1 IIRC; although if it was me, I'd target .net 5 or above – Marc Gravell Jun 08 '21 at 16:09
  • this is where I reveal my platform is Unity engine, so things are a bit complicated. – bitinn Jun 08 '21 at 16:13
  • @bitinn and this is when I say it would have great to include that in the question :) a fixed/unsafe pointer coerce should work, but you'll need to copy the data to a byte[] to make it work :/ – Marc Gravell Jun 08 '21 at 22:52
  • I found a solution that makes your suggestion works in Unity, sharing it as answer :) – bitinn Jun 09 '21 at 03:39
1

For future readers, I have figure out a solution that allow me to use @Marc Gravell's proposed solution within Unity Engine, which only allows for .Net Standard 2.0;

The trick was to get the high performance package from Microsoft:

https://learn.microsoft.com/en-us/windows/communitytoolkit/high-performance/introduction

This means that you can use it from anything from UWP or legacy .NET Framework applications, games written in Unity, cross-platform mobile applications using Xamarin, to .NET Standard libraries and modern .NET Core 2.1 or .NET Core 3.1 applications.

It supports Stream.Write and Stream.Read with Span<T> through extension methods:

https://learn.microsoft.com/en-us/dotnet/api/microsoft.toolkit.highperformance.extensions.streamextensions?view=win-comm-toolkit-dotnet-6.1

I also compared on-disk size of binary serialization:

  • In memory: 8KB (16x16x16 array)
  • MessagePack format: 97KB
  • MemoryMarshal.Cast: 8KB

So it's working as expected!

bitinn
  • 9,188
  • 10
  • 38
  • 64
  • I've [looked it up](https://github.com/windows-toolkit/WindowsCommunityToolkit/blob/main/Microsoft.Toolkit.HighPerformance/Extensions/StreamExtensions.cs), and: this library works by leasing an array and performing a mem-copy from the input to the array; two points with this: 1) it will be slower than necessary, due to the extra mem-copy (although I don't think that's avoidable in your case), and 2) it uses buffers based on your input size, but note that the array pool is *capped* - beyond a certain size it simply allocates *every time*, so for large inputs this could cause allocation problems – Marc Gravell Jun 09 '21 at 06:32
  • I'm not saying "don't use it" - however, I would recommend adding some slice/loop code, so that you don't call `Read` or `Write` with the entire slice - instead, only give it, say, 64KiB of your data each call (and call in a loop) – Marc Gravell Jun 09 '21 at 06:34
  • @MarcGravell Thx for the tips, I wonder how you feel about this approach instead? It seems like I cannot avoid some allocation either way in 2.0 land? https://stackoverflow.com/questions/3278827/how-to-convert-a-structure-to-a-byte-array-in-c – bitinn Jun 09 '21 at 06:41
  • Also I am now reading more about ArrayPool limit, does seem a bit low for my use case (1024x1024): https://adamsitnik.com/Array-Pool/ – bitinn Jun 09 '21 at 06:50
  • "about this approach instead?" - happy to opine, but: what aspect are you asking about? there's no need to use the `Marshal` step - span casts do that for free; you *will* need a `byte[]` at some point for the `Read`/`Write` APIs - and the array pool is fine for that *as long as* you keep the size reasonable - but: it doesn't need to fit your data all at once! you can use a modest buffer and loop very easily – Marc Gravell Jun 09 '21 at 08:52
  • @MarcGravell I went with your solution but with a smaller Span length to make sure we don't allocate with Array. – bitinn Jun 09 '21 at 11:03