4

Is it possible to create a constant dictionary or array at compile time in C#? If not, is it possible to do this in C or C++ and compile to a dll that can be pulled into a C# project?

I need to create a dictionary with about 130M members, and generating it reasonably fast is consuming a lot of my time. I imagine the fastest way to do it would be to programmatically generate a huge text file and compile it to a constant array one time and then never pay that price again. I'm just not finding this feature in C#. I feel like I could do this in C.

Edit 1: This takes up about 1.5GB in RAM, if I'm not mistaken. Many useful comments have exposed the fact that it would take up at least this much on disk as an exe, which is probably not deployable. However, I am still interested in the answer for problems that are less enormous.

Edit 2: Understanding the application may answer a lot of the questions. This is a monte carlo simulator. A card is one bit, from 2clubs =0x1 to Aspades = (0x1<<52). A hand is a ulong with 7 bits set, indicating the cards in the hand.

This storage scheme allows me to do bitwise operations in lieu of cumbersome logic. For instance, to detect a flush, I can mask all hearts and then do popcount > 4, then repeat for clubs, diamonds, spades.

Then you have to determine the value of a hand. But if you are in the middle of permuting the cards, this is trivial, because the high card is given by the permutation step. If you are not in the middle of a permutation then determining the rank is much more cumbersome. This is why generating values on demand is undesirable. You have to pay the cost somewhere, but the cost is much higher on the fly than it is up front.

The end goal is to get them into a dictionary, so that determining the better of 2 hands is a dictionary lookup.

The fastest way to do this (that I've come up with) is store all hands/ranks in a tuplet array, then do ToDictionary(). The downside is that arrays obviously have no duplicate checking. This means that, in the flush example, I have to check that there is no straight flush before pushing to the array, whereas in a dictionary, I can just put the straight flushes in first, then try to push the flushes in. So you pay the cost of checking for straight flushes or you pay the cost of pushing 1 item at a time into a dictionary.

The dictionary keys are all possible 7 card poker hands. The values are the relative strength of each hand, e.g., straight flush to ace has value zero.

  • 2
    Does this answer your question? [Declare a const array](https://stackoverflow.com/questions/5142349/declare-a-const-array) – Wyck Sep 17 '22 at 05:08
  • Why not make a lazy Dictionary that generates the Key's corresponding value on first access? – CSharpie Sep 17 '22 at 05:45
  • 1
    Are you using the [constructor](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.dictionary-2.-ctor#system-collections-generic-dictionary-2-ctor(system-int32)) that allows to configure the `capacity` of the dictionary? This could help at speeding up the initialization. – Theodor Zoulias Sep 17 '22 at 06:53
  • What is the type of the keys and values of the dictionary? – Theodor Zoulias Sep 17 '22 at 07:17
  • 1
    Does your machine have enough RAM to store a dictionary with 130 million entries, without resorting to [memory paging](https://en.wikipedia.org/wiki/Memory_paging)? – Theodor Zoulias Sep 17 '22 at 07:21
  • 2
    What is the original source you pick the data from? A database or a text file or? – PMF Sep 17 '22 at 07:54
  • Wyck, No. It is my understanding that static readonly arrays are populated at run time. – Major Major Sep 17 '22 at 15:09
  • Theodore, Yes I am. This i helpful. But it dosn't get me to where I want to be – Major Major Sep 17 '22 at 15:10
  • Theodor, Great question on the RAM. The development machine has 16 GB. production environment will be azure web services. So yes, probably, in production. It seems pretty close in development. The dictionary is (ulong, int) – Major Major Sep 17 '22 at 15:15
  • PMF: I this worked, it would be a programmatically generated source code file. – Major Major Sep 17 '22 at 15:16

2 Answers2

3

I'd rather try the following:

Create the dictionary in memory once then serialize it and save it to a file using this.

Then you can simply load the dictionary from the file when your application is starting. That way you won't end up with a 130 MB exe. If you really want it to be part of your executable, you can still embed it as a binary resource.

StefanFFM
  • 1,526
  • 1
  • 14
  • 25
  • 2
    Judging from the [implementation](https://github.com/dotnet/runtime/blob/58f5c95085b2eb56800b87f466183b1448d33b3d/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs#L869), deserializing a dictionary shouldn't be much faster that constructing it from scratch. The serialized data do not include the hash of each entry, so each of the 130 million keys will have to be hashed at runtime. If the keys are (for example) long strings, this could be quite expensive. – Theodor Zoulias Sep 17 '22 at 07:00
  • 1
    The post also mentions ProtoBuf which appears to be faster. Haven't tried it myself though. – StefanFFM Sep 17 '22 at 09:40
  • I'd have no problem with a 130MB exe. But you are still right, because it's actually a 1.5 GB exe file--and a 20GB or so text file. It did not register with me that the exe would (obviously) be so big that a) it would be hard to deploy and b) the disk read would be a monster bottleneck. As far as deserializing, I get that that makes more sense, architecture-wise. But it's very slow. – Major Major Sep 17 '22 at 15:35
1

I made a few measurements to evaluate the cost of initializing an in-memory database of this size.

Using a Dictionary<ulong, int>:

Dictionary<ulong, int> dict = new(130_000_000);
for (int i = 0; i < 130_000_000; i++) dict.Add((ulong)i, i);

Result:

Duration: 20,030 msec, Allocated: 3,640,002,808 bytes

Using a SortedList<ulong, int>:

SortedList<ulong, int> dict = new(130_000_000);
for (int i = 0; i < 130_000_000; i++) dict.Add((ulong)i, i);

Result:

Duration: 32,979 msec, Allocated: 1,560,003,920 bytes

Using two arrays (ulong[] and int[]), filling them with pre-sorted data, and intending to use them with the Array.BinarySearch<T> method:

ulong[] keys = new ulong[130_000_000];
int[] values = new int[130_000_000];
for (int i = 0; i < 130_000_000; i++) keys[i] = (ulong)i;
for (int i = 0; i < 130_000_000; i++) values[i] = i;

Result:

Duration: 1,707 msec, Allocated: 1,560,000,840 bytes

My PC is quite old and has only 4GB RAM, so these experiments could yield significantly different results in a higher-end machine.

My observations are:

  • The Dictionary has the higher cost regarding the required memory, and the duration of initialization is significant. It should be the most performant option though, for doing searches.

  • The SortedList has low memory cost, but even higher initialization cost. Searching involves binary search, so it should be as performant as the next option.

  • The ulong[]/int[] array pair has low memory cost and low initialization cost. Doing a binary search on a 130,000,000-elements array requires at most 28 ulong comparisons, so it should be slower than the Dictionary, but not terribly slow.

My conclusion is that you might want to invest some time for researching and implementing a custom data structure, or looking for alternative options like in-memory databases. The data structures that are built-in the .NET standard libraries are probably inadequate for such a heavy load, combined with a fast-initialization requirement.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • Theodor, Thanks a lot for doing this. Adding 1 entry at a time to a dictionary is very very painful. Building up an array and doing todictionary is an order of magnitude faster. It uses about 2x the memory, but thearray should eb GC'd eventually, right? – Major Major Sep 19 '22 at 23:33
  • 1
    @MajorMajor yes, the array will be garbage collected, but based on [the implementation](https://github.com/dotnet/runtime/blob/001c1b186cbb13582556ce541e613464fa5a185f/src/libraries/System.Linq/src/System/Linq/ToCollection.cs#L35) of the `ToDictionary` method it shouldn't be any faster than adding one at a time. Basically it does the same thing internally. – Theodor Zoulias Sep 20 '22 at 06:01
  • Not sure why, but adding more or less random ulongs to a dictionary takes an order of magnitude longer than adding the ordered ulongs as you generated them. Just adding all of the possible hands, which take about 1.5 seconds to enumerate, takes so long that I was never able to let it finish--it was on track for 10-30 minutes. HOWEVER, dict.TryAdd(ulong.GetHashCode(), int) is 10x faster than dict.TryAdd(ulong, int), which I think is probably the solution. – Major Major Oct 04 '22 at 00:39