0

I want to generate a large amount (10 TB) of seemingly random, but predictable numbers. The generation speed should exceed that of fast SSDs, so I want 3000 MB/s to 4000 MB/s.

After the file has been written, the numbers will be read again and generated again, so that they can be compared. The total program is supposed to check disks.

At the moment I'm thinking of hashes. The data to be hashed is just a 8 byte number (ulong) for the predictability. So in the binary file it looks like this

<32 bytes of SHA256(0)> <32 bytes of SHA256(1)> ...

I don't think I can use a Random number generator with a seed, because I can't tell the random number generator to generate the nth number. But I can tell the SHA256 algorithm to calculate SHA256(n).

I made a test with 128 MB of data using the SHA256 algorithm like this:

Parallel.For(0, 128 * 1024 * 1024 / 32,     // 128 MB / length of the hash
    a => {
        var sha = SHA256.Create();
        sha.Initialize();
        var ba = new byte[8];
        ba[0] = (byte)((long)a >> 0 & 0xFF);
        ba[1] = (byte)((long)a >> 8 & 0xFF);
        ba[2] = (byte)((long)a >> 16 & 0xFF);
        ba[3] = (byte)((long)a >> 24 & 0xFF);
        ba[4] = (byte)((long)a >> 32 & 0xFF);
        ba[5] = (byte)((long)a >> 40 & 0xFF);
        ba[6] = (byte)((long)a >> 48 & 0xFF);
        ba[7] = (byte)((long)a >> 56 & 0xFF);
        var hash = sha.ComputeHash(ba);
        // TODO: aggregate the byte[]s, stream to file
    }
);

Like that, the throughput is only 95 MB/s on my Ryzen 7 2700X 8 core processor running at 4,08 GHz.

Any chance of speeding this up to 4000 MB/s?

Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
  • *"seemingly random, but predictable"* -- What do you mean by that? What are the desirable properties of those numbers? How are you going to use them? – Theodor Zoulias Jan 03 '22 at 22:20
  • 1
    Does it have to be hashes, as opposed to any other pseudo-random numbers? – harold Jan 03 '22 at 22:47
  • "Seemingly random, but predictable" - what's the motivation for this? You can use `Random` with a constant seed to generate numbers but you might as well just have a text file with 10TB of pre-created random numbers. – asaf92 Jan 04 '22 at 08:08
  • @TheodorZoulias: I've added the motivation to the question. It shall be a disk test program. – Thomas Weller Jan 04 '22 at 10:16
  • @harold: it could be any other pseudo random number. However, I need to be able to calculate the n-th number without calculating all other numbers before. – Thomas Weller Jan 04 '22 at 10:17
  • @asaf92: I've updated the question. I don't think Random with a seed works, because I need to be able to calculate the n-th number on its own. – Thomas Weller Jan 04 '22 at 10:18
  • How about storing the numbers as they are, unhashed? Would this make the disc-testing procedure less reliable? – Theodor Zoulias Jan 04 '22 at 10:32
  • @TheodorZoulias: yes. Given I store 64 bit numbers (in order to reach 10TB disks and above) and I'm writing to a 4 GB SD card (32 bit), the upper half of the data would always be 0 and the program would not detect broken or shorted wires in that part of the data. – Thomas Weller Jan 04 '22 at 10:41
  • How about creating `long` numbers [from two `int`s](https://stackoverflow.com/questions/827252/c-sharp-making-one-int64-from-two-int32s)? Or simply by using the same `int` twice? `long number = (long)i << 32 | (long)(uint)i;` – Theodor Zoulias Jan 04 '22 at 11:08
  • I'm no expert but as far as I understand, one of the features of a good hashing algorithm is to work relatively slow, because it's usage in cryptography, the slower it runs the harder is using it for brut force attacks. You really should consider a different way to generate your pseudo-random values. – Zohar Peled Jan 04 '22 at 14:20
  • Check out [this SO answer](https://stackoverflow.com/a/5385501/3094533) by Ani. It's simple, easy, and I think can give you almost exactly what you're looking for. (Instead of using just the index, you can use the index + your own seed if you want to) – Zohar Peled Jan 04 '22 at 14:29
  • 1
    @ZoharPeled: thanks for the suggestion. Random numbers with a seed is just generating numbers at ~150 MB/s, – Thomas Weller Jan 04 '22 at 16:08
  • FYI: in the meanwhile I have implemented XXH64 and optimized for small data (8 bytes). That gives me ~1600 MB/s to 2000 MB/s. – Thomas Weller Jan 04 '22 at 16:11
  • XXH64 @ 3200+ MB/s now. – Thomas Weller Jan 04 '22 at 20:01
  • Assuming that the quality of the hash is not a top priority, you might find some even faster hashing algorithms [here](https://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key "What integer hash function are good that accepts an integer hash key?") or [here](https://gist.github.com/badboy/6267743 "Integer Hash Function"). – Theodor Zoulias Jan 04 '22 at 22:28

1 Answers1

6

I don't think is is possible to reach that speed without using a gpu. But here are few things you can do to gain some performance:

  1. You can utilize the localInit of Parallel.For to create the SHA256 object, as well as a byte array of size 8 to hold the data to be hashed, once per task.
  2. There is no need to explicitly call Initialize.
  3. Instead of converting the long to byte array manually, one byte at a time, you can use pointers or the Unsafe class to set the bytes all at once.
  4. Pre-allocate the array of bytes that will hold the hash and use TryComputeHash instead of ComputeHash since it allows passing a span for the output.

Here is a code implementing the mentioned above:

Parallel.For(0, 128 * 1024 * 1024 / 32,     // 128 MB / length of the hash
  () => (SHA256.Create(), new byte[8], new byte[32]),
  (a, state, tuple) =>
  {
    Unsafe.As<byte, long>(ref tuple.Item2[0]) = a;
    tuple.Item1.TryComputeHash(tuple.Item2, tuple.Item3, out _);
    var hash = tuple.Item3;
    // TODO: aggregate the byte[]s, stream to file
    return tuple;
  },
  tuple => tuple.Item1.Dispose()
);
Sohaib Jundi
  • 1,576
  • 2
  • 7
  • 15