0

I've been running a database application that writes data synchronously to disk, and so, looking for the best disk throughput. I've found that GCP's local SSDs are supposed to provide the best performance (both for IOPS and MB/s). However, I've tried using them and found that when performing a benchmark of synchronous database writes, the throughput achieved by a persistent zonal SSD is significantly better than that of the local SSD. Strangely the use of a single local SSD results in better performance than a RAID configuration with 4 partitions.

To test the performance I ran a benchmark consisting of a single thread creating transactions in a loop and performing a random 4KB write.

The persistent zonal SSD was 128GB, while the local SSD consists of 4 SSDs in RAID 0. An N2D machine with 32 vCPUs was used to eliminate CPU bottleneck. To ensure it wasn't a problem the with OS or filesystem, I've tried various different versions, including the ones recommended by Google. However, the result is always the same regardless.

The results for my experiments on average are:

SSD Latency Throughput
Zonal P SSD (128 GB) ~1.5ms ~700 writes/second
Local SSD (4 SSDs NVME RAID 0) ~14ms ~71 writes/second
Local SSD (1 SSD) ~13ms ~75 writes/second

I'm at a bit of a loss on how to proceed, as I'm not sure if this result should be expected. If so, it seems like my best option is to use zonal persistent disks. Do you think that these results seem correct, or might there be some problem with my setup?

Suggestions of turning of write-caching etc. will improve performance, however, the goal here is to obtain fast performance for synchronous disk writes. Otherwise, my best option would be zonal persistent SSDs (they offer replicated storage) or just using RAM which will always be faster than any SSD.

As AdolfoOG suggested, there might be an issue with my RAID configuration so to shed some light on this, I use the following commands to create my RAID 0 setup with four devices. Note, /dev/nvme0nX refers to each NVMe device I'm using.

sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 
sudo mkfs.ext4 -F /dev/md0 
sudo mkdir /mnt/disks/ 
sudo mkdir /mnt/disks/stable-store 
sudo mount /dev/md0 /mnt/disks/stable-store 
sudo chmod a+w /mnt/disks/stable-store 

This should be the same process as what Google advises unless I messed something up of course!

  • How is the storage configured regarding write caching? https://stackoverflow.com/q/27087912/8016720 – John Hanley Dec 07 '20 at 20:36
  • I suppose a write-through pattern is used. (Each transaction is created, writes 4KB, then commits with an fsync.) – Michael Davis Dec 07 '20 at 22:58
  • 1
    There is something wrong with your local SSD configuration. SSDs have very small latencies around 1ms. Latencies of 21ms are more in line with spinning hard disks. – John Hanley Dec 08 '20 at 00:35
  • Do you know what this might be? I've set everything up according to Google's recommendations. NVMe, Ubuntu image optimised for NVMe, RAID 0 setup. Similar issues are reported here: https://medium.com/@rimantasragainis/cloud-nvmes-the-blind-side-of-them-da927d09b378. The author there found a optimised OS image that vast improved performance but it was an experimental image given by Google staff so I can't rely on that as a solution. I've run the same benchmark on my own local computer and get the results you suggest would be correct, but with GCP local NVMe the performance is much worse. – Michael Davis Dec 08 '20 at 12:46

2 Answers2

2

Answer completely edited after original question edited:

I tried to replicate your situation, I used a more "stock" approach, I didn't code anything to test the MB/s, instead I just used "dd" and "hdparm", I also used a N2-standard-32 instance type with a 100 GB Persistent SSD as boot disk and a RAID 0 of 4 NVME Local SSDs. below my results:

Write tests:

root@instance-1:~# dd if=/dev/zero of=./test oflag=direct bs=1M count=16k 16384+0 records in 16384+0 records out 17179869184 bytes (17 GB, 16 GiB) copied, 18.2175 s, 943 MB/s

root@instance-1:~# dd if=/dev/zero of=./test oflag=direct bs=1M count=32k 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB, 32 GiB) copied, 42.1738 s, 815 MB/s

root@instance-1:~# dd if=/dev/zero of=./test oflag=direct bs=1M count=64k 65536+0 records in 65536+0 records out 68719476736 bytes (69 GB, 64 GiB) copied, 83.6243 s, 822 MB/s

Local SSD:

root@instance-1:~# dd if=/dev/zero of=/mnt/disks/raid/test oflag=direct bs=1M count=16k 16384+0 records in 16384+0 records out 17179869184 bytes (17 GB, 16 GiB) copied, 10.6567 s, 1.6 GB/s

root@instance-1:~# dd if=/dev/zero of=/mnt/disks/raid/test oflag=direct bs=1M count=32k 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB, 32 GiB) copied, 21.26 s, 1.6 GB/s

root@instance-1:~# dd if=/dev/zero of=/mnt/disks/raid/test oflag=direct bs=1M count=64k 65536+0 records in 65536+0 records out 68719476736 bytes (69 GB, 64 GiB) copied, 42.4611 s, 1.6 GB/s

Read tests:

Persisten SSD:

root@instance-1:~# hdparm -tv /dev/sda

/dev/sda: multcount = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 13054/255/63, sectors = 209715200, start = 0 Timing buffered disk reads: 740 MB in 3.00 seconds = 246.60 MB/sec root@instance-1:~# hdparm -tv /dev/md0

Local SSD

/dev/md0: readonly = 0 (off) readahead = 8192 (on) geometry = 393083904/2/4, sectors = 3144671232, start = unknown Timing buffered disk reads: 6888 MB in 3.00 seconds = 2761.63 MB/sec

So, I'm actually seeing better performance in the local SSD raid and, according to the table of performance, I got the expected result for reads, and writes according to this table:

Throughput (MB/s):  Read: 2,650;  Write: 1,400

So, maybe there is something odd with the way you tested the performance as you mentioned that you write a little script to do it, maybe if you try with a more "stock" approach you'll get the same results as I got.

AdolfoOG
  • 186
  • 7
  • Sorry, I made a mistake in my table. I meant to write 1.5ms for a zonal persistent SSD (updated now)! The experiments were also carried out on 32 vCPU N2D machines. I have also checked out and implemented the suggestions in the articles you referenced before but they made no improvement beyond the results reported. My goal is to have a high throughput SSD, and so I would be fine with either option. I suppose I'm just surprised that zonal SSDs seem to be better performing than local SSDs which are advertised to be the performance-focused option. – Michael Davis Dec 07 '20 at 22:47
  • In reference to my experiments, I have just created a little benchmark program that in a loop, creates transactions, writes 4KB, and commits. This is used to obtain the latency and throughput of synchronous disk writes. LMDB is set up to perform an fsync for each commit. – Michael Davis Dec 07 '20 at 22:53
  • Got you, so just as an experiment, can you try to run the test using only one local SSD? (not RAID) probably there is an issue with how the raids are created in GCP itself. – AdolfoOG Dec 09 '20 at 18:32
  • Thanks for your response Adolfo, I've run the experiment on a single partition and obtained the following result: Average Latency: ~13ms Throughput: ~75 writes/second This is about the same as when running with a RAID setup, so I think you might be right when it comes to there being a problem when running a RAID setup. Weirdly enough, I do get better results with the RAID setup when mounting with the nobarrier option (~6770 writes/second with 4 partitions vs. ~5540 writes/second with 1 partition). – Michael Davis Dec 10 '20 at 13:20
  • To give more light on how I'm setting up the RAID partition, I use the following commands: sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 sudo mkfs.ext4 -F /dev/md0 sudo mkdir /mnt/disks/ sudo mkdir /mnt/disks/stable-store sudo mount /dev/md0 /mnt/disks/stable-store sudo chmod a+w /mnt/disks/stable-store This should be the same process as what Google advises, unless I messed something up of course! – Michael Davis Dec 10 '20 at 13:21
  • I tried replicating the situation on my side, it took a while as I didn't have quota and stuff but I'm gonna edit my answer with my findings. – AdolfoOG Dec 11 '20 at 18:23
  • Thanks for your followup Adolfo and for performing your tests! If I run the tests you have suggested I do get the expected results of 1.6GB/s for writes on a local SSD. For my use case a test similar with 4KB random writes synchronously written to the database would be the following dd command: dd if=/dev/zero of=/mnt/disks/stable-store/test oflag=sync bs=4KB count=1k+0 records. With the test I get a result of 541 kB/s, which is a bit better than the converted result on my benchmark (75 writes a second * 4KB writes = 300KB/s). Do you see a similar result with this command? – Michael Davis Dec 11 '20 at 20:16
  • 1
    Actually yes, I do see the same behaviour with a lower bs, also for direct thou. However for sync and direct differences in the dd command I'd check on this post: https://unix.stackexchange.com/questions/508701/dd-command-oflag-direct-and-sync-flags Honestly for raw performance "direct" sounds ideal to me, but I could mitigate the slowness using sync by increasing the bs, still the result didn't pass of ~700 MB/s – AdolfoOG Dec 11 '20 at 22:10
  • Unfortunately I don't think I could change any of my settings in regards to write-size and synchronous writes as I need to write random 4KB (max) writes synchronously. This is for a database application where I need durable transactions. It seems the local SSD is unsuited for this application, at least compared to persistent SSDs. This seems to go against google guidance however (https://cloud.google.com/compute/docs/disks) – Michael Davis Dec 14 '20 at 12:28
1

Local SSD are optimized for temporary storage and writes are ack'd once they hit the SSD write cache. Then, per the documentation, within 2 seconds those writes will be committed to stable media. Given the reliability guarantees of Local SSD (very low...it can fail and data is lost) this seems like a reasonable tradeoff.

With Local SSD if your application does a write, then a fsync, then a write, then a fsync, the high latency of those fsync calls is going to add up and likely explains the high latency you observed. Solution would be to skip those fsyncs either in in your DB or when you mount the filesystem; see the documentation link mentioned for more. Frankly, whenever you use Local SSD you should be prepared to lose that data, either by being able to recreate it (job processing use case) or because you have redundancy at a higher layer (app/db use case).

With a Zonal PD writes are stable when ack'd, and any fsync is basically a no-op returning quite quickly. The write latency from your tests seems on the high side. If you are reaching IO or throughput limits you will see those plateauing and latency increasing. For the lowest latency, I'd try creating the disk + VM in one request (this increases the likelihood that the resources will be nearby within the zone) and see if that helps. If latency is stable, and you need more IOPs and aren't at disk IOP limits, then most likely you need to increase concurrency to get work "in flight" resulting in higher IOPs.

Chris Madden
  • 2,510
  • 2
  • 17
  • 11