Save large list of ints in memory for fast access

Question

First, a little background: I enjoy working on project Euler problems (https://projecteuler.net/archives) but many of them require a ton of heavy computation so I try to save known constants in memory so they don't have to be recalculated every time. These include things like n!, nPr, nCr, and lists of primes. For the purpose of this question let's just stick with primes because any solution with those can be easily ported to the others.

The question: Let's say I want to save the first 1,000,000 primes in memory for repeated access while doing heavy computation. The 1,000,000th prime is 15,485,863 so ints will do just fine here. I need to save these values in a way such that access is O(1) because these will be access a lot.

What I've tried so far: Clearly I can't put all 1,000,000 in one cs file because Visual Studio throws a fit. I've been trying to break it into multiple files using a partial class and 2-D List<List<int>>

public partial class Primes
{
    public readonly List<int> _primes_1 = new List<int>
    {
        2, 3, ... 999983
    }
}

So _primes_1 has the primes less than 1,000,000, _primes_2 has the primes between 1,000,000 to 2,000,000, etc, 15 files worth. Then I put them together

public partial class Primes
{
    public List<List<int>> _primes = new List<List<int>>()
    {
        _primes_1, _primes_2, _primes_3, _primes_4, _primes_5,
        _primes_6, _primes_7, _primes_8, _primes_9, _primes_10,
        _primes_11, _primes_12, _primes_13, _primes_14, _primes_15
    };
}

This methodology does work as it is easy to enumerate through the list and IsPrime(n) checks are fairly simple as well (binary search). The big downfall with this methodology is that VS starts to freak out because each file has ~75,000 ints in it (~8000 lines depending on spacing). In fact, much of my editing of these files has to be done in NPP just to keep VS from hanging/crashing.

Other things I've considered: I originally read the numbers in off a text file and could do that in the program but clearly I would want to do that at startup and then just have the values available. I also considered dumping them into sql but again, eventually they need to be in memory. For the in memory storage I considered memcache but I don't know enough about it to know how efficient it is in look ups.

In the end, this comes down to two questions:

How do the numbers get in to memory to begin with?
What mechanism is used to store them?

Spending a little more time in spin up is fine (within reason) as long as the lookup mechanism is fast fast fast.

Quick note: Yes I know that if I only do 15 pages as shown then I won't have all 1,000,000 because 15,485,863 is on page 16. That's fine, for our purposes here this is good enough.

Having this as a list like that is a bad idea since you are now storing the data twice: Once in the program code, and the once in memory when you load the data. – Why do you need all the data in memory anyway? — poke, Mar 11 '18 at 00:48
You can have O(1) and use less space with [`BitArray`](https://msdn.microsoft.com/en-us/library/system.collections.bitarray) and [`BinaryFormatter`](https://msdn.microsoft.com/en-us/library/system.runtime.serialization.formatters.binary.binaryformatter#Examples) for storing it in a file — Slai, Mar 11 '18 at 01:02
I'm voting to close this question as off-topic because it is too broad but also because it is essentially a programming excercise to attract all those who like to demonstrate their cleverness ignoring [ask] in the process as the current upvotes clearly demonstrate — , Mar 11 '18 at 01:26

spender · Answer 1 · 2018-03-11T01:03:27.437

2

Bring them in from a single text file at startup. This data shouldn't be in source files (as you are discovering).

Store them in a HashSet<int>, so for any number n, isPrime = n => primeHashSet.Contains(n). This will give you your desired O(1) complexity.

HashSet<int> primeHashSet = new HashSet<int>(
    File.ReadLines(filePath)
        .AsParallel() //maybe?
        .SelectMany(line => Regex.Matches(line, @"\d+").Cast<Match>())
        .Select(m => m.Value)
        .Select(int.Parse));
Predicate<int> isPrime = primeHashSet.Contains;
bool someNumIsPrime = isPrime(5000); //for example

On my (admittedly fairly snappy) machine, this loads in about 300ms.

edited Mar 11 '18 at 01:03

answered Mar 11 '18 at 00:24

spender

117,338
33
229
351

1

Although both approaches will work. Using a `Predicate` as opposed to `Func` is more appropriate in this specific case. – Ousmane D. Mar 11 '18 at 00:37
@Aominè I tend to avoid most defined delegates in favour of `Func` and `Action` because of the layer of indirection when it comes to reading and understanding the code. `Func` and `Action` advertise their intent in a very readable way (unless the parameter list is long). Not sure which way I fall with `Predicate`, but I'll give it a go here :) – spender Mar 11 '18 at 00:50
Note that initializing `HashSet` from an IEnumerable with that many items may be rather expensive. You might do better with doing `.ToArray()` first. – poke Mar 11 '18 at 00:51
@spender Right, a matter of preference I guess. ;) – Ousmane D. Mar 11 '18 at 00:57
@poke I just measured... less that 3% speed gain. I feel cheated :) – spender Mar 11 '18 at 01:00
Would it not be a speed advantage if they were Constants? Of course that asumes that you can even do cosntants on such a scale - possibly even Constant Arrays. I am not sure that even works. – Christopher Mar 11 '18 at 01:05
@spender Hmm, I expected that to have a bigger impact. Good to know! – poke Mar 11 '18 at 01:11
As for using constants: A lot of time is spent just doing runtime sanity checks on the indexers for Collections. If one could move parts of it to the Compiler level, that might be some speedup. But in the end Indexers are the one thing one can only really check at runtime. @poke & spender: Doing a meaningfull Benchmark in a GarbageCollected, JiT Compiled runtime is not that easy. It is something I would most likely mess up myself. – Christopher Mar 11 '18 at 02:07
@Christopher Do you know [BenchmarkDotNet](http://benchmarkdotnet.org/)? – Anyway, I ended up benching it myself and got to 8% time speedup with a `.ToArray()`. So not really worth it. – poke Mar 11 '18 at 02:16
What if we supressed the bounds checks on the indexes? Back in my old native C++ days, when arrays were realised via naked pointers, we had to pick the bounds of our for loops so we would not accidentally cause a buffer overflow. Like pretty much the entirety of hte Windows XP development team. I fully support the BoundsChecks for average programmers. Handling Naked pointers is simply not for every programmer. Or something I would consider doing in any normal case. But if performance realy maters, just turning ot off for a spell might be a good idea. – Christopher Mar 11 '18 at 03:29
Of course I just read that the JiT/CLR can do bounds check Eliminitation (https://stackoverflow.com/questions/9304726/array-bounds-check-elimination-in-the-clr). So how helpfull for performance that is is uncertain. – Christopher Mar 11 '18 at 03:33
@spender While the `isPrime` will certainly get utilized it is actually less of an issue that just regular indexed lookup time. I need somewhich which I can iterate over as well as do simple lookups a la `_primes[12]` to get the 13th prime. How efficient would this be for lookups? – nurdyguy Mar 11 '18 at 05:24
Bad. If indexed lookup is what you need, just .ToArray() the sequence fed to the hashset constructor. – spender Mar 11 '18 at 11:59

Save large list of ints in memory for fast access

1 Answers1