Microsoft Storage Spaces Direct (S2D)
Anarchy: Well, you’ll figure out who’s sitting on the couch, and who’s holding the camera 😊.
Bio & a bit of History Channel
Microsoft does not understand storage. They never really did! Period. Well, just take a look back at what these clowns did in pre-S2D times! Basically, it was one big clusterfuck. First, Microsoft created Storage Spaces. Software RAID. LVM on Windows. Nice! Remember its miserable performance in general and slow parity writes in particular? Painful rebuilds taking days and weeks to finish? Sparse CLI and anemic GUI management? Lack of any health monitoring except polling poor little thing with PowerShell scripts? Next, as these guys realized they need something shared as virtualization had kicked in, Microsoft released now Clustered Storage Spaces optionally paired with Scale-Out File Server (SOFS). Well, that was a really “genius” solution! You needed at least two servers acting as storage controllers and a SAS JBOD acting as a backplane, and a full JBOD of SAS disks to build some shared storage to makes your VMs happy. It looks and feels like a poor men’s NetApp / Nimble / Compellent (dual controllers, single shared SAS backplane etc), isn’t it? Except the whole goddamn thing wasn’t THAT cheap! …and it had no real GUI behind, forcing system administrator to learn PowerShell for every single fart. Does Microsoft understand the difference in between “sex” and “forced sex”? Well, compare this to a very neat NetApp / Nimble / Compellent web UI and a super-clean ssh CLI! OK, whatever… Cheap SATA disks and super-performing NVMe drives couldn’t be used. Just because Microsoft only accepted SAS for dual ports. Idiots… Plus, this piece of shit couldn’t use RAM as a write-back cache, something NetApp & Nimble could do from the very beginning! At the end of the day it was a good solution… For fools who just wanted to leave their money on the table! For people who didn’t give a fuck about SLA (Service Level Agreement). I can bet nobody at Microsoft understands people don’t buy SANs for capacity or for IOPS only, SAN is your fucking insurance and a 100% frictionless way to focus on your mainstream activities instead of babysitting your storage. Well, with no HCL (Hardware Compatibility List) every single firmware upgrade with Clustered Storage Spaces was turning into Russian roulette game. Again, Clustered Storage Spaces + SOFS started as one big steaming pile of rubbish. Eventually, it matured a bit, and just when people realized how to keep it running for days and themselves in business without committing suicides, Microsoft changed the focus. Idiots never give up and, this is how Storage Spaces Direct appeared.
Now, let’s look at S2D itself. Is it a gleam of light or just one more Microsoft’s failure? Keep reading and we’ll figure it out. S2D was designed as highly-available (HA) and highly-scalable software-defined storage (SDS). It enables to create virtual volumes on top of storage pool which, in turn, can be comprised of many physical disks. This time these can be spinning rust, SATA SSDs or PCIe attached NVMe drives in pretty much any possible combination. S2D utilizes only flash caching (no RAM cache still…), storage tiering, and erasure coding on 4 and more nodes. Together with RDMA (If your switches support it, lol, if you don’t have RDMA gear you can go fuck yourself, it’s a complete “no go” as S2D is bandwidth pig and CPU ogre…) and NVMe drives (Sure, you have to have some…) that son of a bitch delivers “unrivaled efficiency and performance”. On the paper 🙂 A bitter thing about S2D is licensing. As a feature it’s available only as part of Windows Server Datacenter Edition. In other words, to get S2D, you basically have to pay for all the “bells & whistles” Datacenter Edition has… even if you never asked for them! You know, it looks like Microsoft management was too lazy to create one more SKU like VMware did with their vSAN, so they just added their proprietary SDS to the rich man’s license. Fuck! So I have Nimble and need unlimited licensed VMs? I pay for S2D! I need only S2D for my SQL Server or HA file server, but I don’t run any VMs. I pay for unlimited VMs! Storage Replica is included regardless of what. Nice! Brain-dead motherfuckers 🙁 “New wave” of marketing. Whatever… S2D can be implemented in both converged and hyper-converged scenarios. In other words, it works in whatever virtualized environment you get, be it comprised of separate servers for compute and storage, or a single box for both layers “alles zusammen”.
Now, we believe you know enough about Storage Spaces Direct. Let’s move on!
Today, we’re filming a tape with Storage Spaces Direct (S2D) running on top of the 8 NVMe datastore (2 Enterprise-grade NVMes per host). As usual, we’ll investigate how solution performs in the 4-node cluster. For test purpose, we’ll gradually increase the number of VMs running in the cluster till the overall performance stops growing. At the end, we’ll run the same tests on the single NVMe to get some reference values. Based on these reference values, we’ll judge on the overall S2D performance. How far they can jump.
1. Check whether underlying NVMe (Intel SSD DC P3700 2TB) perform as the vendor says. Tests are held within “bare metal” Windows Server 2016 environment.
2. Build a 4-node cluster and enable Hyper-V role everywhere.
3. Pool 8 NVMes into one S2D pool.
4. Create a bunch of Virtual Disks on top of that storage pool, and present them as Cluster Shared Volumes (CSV) to the cluster. Note, each VM will get it’s own Cluster Shared Volume which overrides Microsoft’s own recommendation of having at least one CSV per physical Hyper-V host.
5. Create a Windows Server 2016 VM. Its “system” disk is located on the host itself, while its “data” 256GB Hyper-V Virtual Disk is located on the Cluster Shared Volume managed by S2D. Today, this larger “data” Virtual Disk gonna be our working horse.
6. Come up with the optimal testing parameters (Numbers of threads and Outstanding I/Os) for DiskSPD and FIO as we’ll use both these tools to compare what we’re getting.
7. Test a Hyper-V VM Virtual Disk performance.
8. Clone a VM to another Hyper-V host and benchmark combined VM “data” Virtual Disks performance.
9. Warm up. With Nutanix CE and VMware vSAN ridiculous performance in mind, we decided to run 12 VMs in the cluster first to see whether today’s competitor can do well with it. Remember, we stopped increasing number of VMs for Nutanix and VMware as they shitted their pants after we reached 10 running I/O intensive VMs per cluster.
10. Clone’n’test cycle. Repeat step 8 until the overall cluster performance stops growing.
11. Getting reference. Test a single Intel SSD DC P3700 2TB NVMe performance within “bare metal” Windows Server 2016 environment. Data we’ll get at this point will be used as a reference.
What’s on the stage?
Here’s the setup used for measuring a single Intel SSD DC P3700 2TB “raw” performance:
1xDell R730 server, CPU 2x Intel Xeon E5-2683 v3 @ 2.00 GHz (14 physical cores per-CPU), RAM 64GB
Storage: 1x Intel SSD DC P3700 2TB
OS: Microsoft Windows Server Datacenter 10.0.17650 N/A Build 17650
Below, the setup configuration itself, this time it’s cluster:
Host #1, Host #2, Host #3, Host #4
4x Dell R730 servers, 2x CPU Intel Xeon E5-2683 v3 @ 2.00 GHz (14 physical cores per-CPU) per-host, RAM 64GB
Storage: 8x Intel SSD DC P3700 2TB (2 per host)
LAN: 2x Mellanox ConnectX-4 100GB CX465A
OS: Microsoft Windows Server Datacenter 10.0.17650 N/A Build 17650
To make the stuff we wrote above clear here’s the interconnection diagram:
Now, let’s check everything, do some preparations, and, finally, go to the action.
Checking the Gate
Before we start, we’d like to ensure that Intel SSD DC P3700 2TB drives perform exactly as the vendor says.
Look at these official numbers Intel gives us regarding this particular NVMe drive performance:
According to the vendor, Intel SSD DC P3700 2TB achieves 460K IOPS under random reads with 4 workers and Queue Depth 32. Good!
Well, let’s get some proof. For that purpose, we checked the claimed performance under 4k random read pattern with DiskSPD v2.17 and Fio v3.5.
The plots below highlight the testing results under varying Queue Depth value.
Our testing results look similar to Intel’s claims: under 4k read pattern with 4 workers and Queue Depth = 32, Intel SSD DC P3700 2TB does reach 460K IOPS. Looks good!
Walkie Check: Network bandwidth
After installing Windows Server on all hosts, let’s install the WinOF-2 (1.90.19216.0) driver for Mellanox ConnectX-4 network interfaces and set up the network. Afterward, check the bandwidth using iperf (for TCP/IP) and nd_send_bw (for RDMA). The later is included in MFT (Mellanox Firmware Tools).
First, let’s look at RDMA network bandwidth as S2D uses RDMA for internode East-West traffic:
Now, let’s check the network bandwidth between two hosts (let’s say, host #1 and host #2).
And, here’s what we got for TCP/IP bandwidth (Just to make sure nothing is broken, S2D *DOES NOT* use TCP unless you’re a brain dead motherfucker who didn’t enable DCB everywhere or lives under the stone and has no fucking idea about RDMA and how useful it’s for your nutrition) :
So, TCP network bandwidth reached 99 Gbit/s, while RDMA network bandwidth reached 90 Gbit/s which is less then TCP but RDMA is supposed to leave more CPU cycles to running VMs, heh…
With those 460K IOPS per-drive in mind, let’s think through it and figure out whether such bandwidth is enough to let S2D ace. There are two concerns:
1) Won’t Mellanox MSN 2100 internal cross-bar switch bandwidth be a bottleneck?
2) Can Mellanox ConnectX-4 provide decent bandwidth to carry out further measurements?
First, the switch we use in our lab is based on Mellanox Spectrum ASIC processor designed to provide the massive 410 GB/s of network bandwidth. Let’s assume that reading occurs simultaneously from eight SSDs in 4k blocks. So, the approximate workload gonna be something close to 460K IOPS * 8 = 3.6M IOPS. Being converted into bandwidth, it is 14.04 GB/s (112.32 Gbit/s). See, 410 GB/s >> 14.04 GB/s. So, the switch won’t be the bottleneck for sure!
Second, there’s also no point to worry about network adapters. Each host has two NVMes as underlying storage. Together, these guys reach 2*460K=920K IOPS. Converted into bandwidth, it is 3.51 GB/s (28.08 Gbit/s) per host. At the same time, Mellanox ConnectX-4 can deliver 12.5 GB/s (90-100 Gbit/s) per port. So, the two ports can provide… Let’s use some super-second-grade-elementary-school-math… Right, 25GB/s (180-200 Gbit/s) . Looking a bit deeper, it becomes obvious that host performance is limited with (25GB/s*1024*1024)/4≈6.5M IOPS. So, looking back at the expected disk performance, 100Gbit/s bandwidth gonna be pretty enough for our tests.
Enabling Hyper-V role
Now, let’s enable Hyper-V role on each Windows server. You can do that with PowerShell (Fuck you, Microsoft! I remember Windows NT 3.5 had no command line I had to learn):
Right after the cmdlet is deployed, the installation process starts. Now, have a coffee and let that Dannish guy finish, LOL:
Once the role is (roles are?) installed, you should see the following output:
Once done with your coffee, go to the Server Manager console and check whether server roles have been installed on each host:
Next, check out whether nodes allow creating an S2D cluster:
At this point, we’d like to tell you about Validation Report. Validation Report <data>.htm file is formed based on the node testing results. It contains possible problems that you may face while creating an S2D cluster and possible ways to resolve them. Note, that S2D itself is not considered a problem 😊.
Now, create the cluster using the following command:
New-Cluster -Name S2D-Test -Node 172.16.0.31,172.16.0.32,172.16.0.33,172.16.0.37 -NoStorage -StaticAddress 172.16.0.40
Go to the Failover Cluster Manager console next to check whether cluster creation has run smoothly:
Then, type Get-PhysicalDik-CanPool to check which disks can be pooled:
Enable S2D with Enable-ClusterStorageSpacesDirect afterward. This line changes the cluster file system to ReFS and forms the storage pool using all available disks. At this point, make sure that you did not enable ReFS data checksumming: You do not want your cluster perform dog slow, don’t you?
After enabling S2D, go to the Server Manager console to check how Pool creation is going:
Then, deploy the command below to create a new virtual disk on S2D Pool, format it to ReFS too, and create a Cluster Shared Volume:
New-Volume –StoragePoolFriendlyName “S2D*” –FriendlyName <disk_name> –FileSystem CSVFS_ReFS -Size 256GB -ResiliencySettingName Mirror -PhysicalDiskRedundancy 1
To reach higher performance, we’ve set Mirror as a Resiliency type and set 1 for PhysicalDiskRedundancy. So, there’s a two-way mirror.
Now, let’s create a Virtual Disk and store there a “data” Hyper-V disk:
Look one more time at Virtual Disk parameters in Server Manager console and Failover Cluster Manager:
Setting VM parameters
Hey, just a sec, what’s the point in putting that fast disks in the datastore? For damn performance, of course! You know, you can use something similar as MS SQL datastore. So, to make today’s testing look like more like something from the real life, let’s pick the standard VM parameters based on ones Microsoft recommends for working with MS SQL in Azure.
- RAM 7GB
- Disk0 (Type SCSI) – 25GB (“system” Virtual Disk)
- Disk1 (Type SCSI) – 256GB (“data” Hyper-V Virtual Disk 256GB)
NOTE: Even though the test VM disk size is set as F
Below, find dd utility launching parameters:
dd.exe bs=1M if=/dev/random of=\\?\Device\Harddisk1\DR1 –progress
Now, let’s create the test VM and pin it to Hyper-V host #1. This VM gonna be used to estimate the optimal DiskSPD and FIO parameters (the number of threads and Outstanding I/O) for further measurements. By “optimum parameters” we mean here the number of threads and Outstanding I/Os corresponding to the maximum performance under reasonable latency. After that point, nothing grows except the latency. At this step, we benchmarked 256GB Hyper-V Virtual Disk performance with 4k random read using varying number of threads and Outstanding I/O parameters values.
Here are DiskSPD and FIO launching parameters for threads=1, Outstanding I/O=1,2,4,8,16,32,64,128:
diskspd.exe -t1 -b4k -r -w0 -o1 -d60 -Sh -L #1 > c:\log\t1-o1-4k-rand-read.txt
diskspd.exe -t1 -b4k -r -w0 -o2 -d60 -Sh -L #1 > c:\log\t1-o2-4k-rand-read.txt
diskspd.exe -t1 -b4k -r -w0 -o4 -d60 -Sh -L #1 > c:\log\t1-o4-4k-rand-read.txt
diskspd.exe -t1 -b4k -r -w0 -o8 -d60 -Sh -L #1 > c:\log\t1-o8-4k-rand-read.txt
diskspd.exe -t1 -b4k -r -w0 -o16 -d60 -Sh -L #1 > c:\log\t1-o16-4k-rand-read.txt
diskspd.exe -t1 -b4k -r -w0 -o32 -d60 -Sh -L #1 > c:\log\t1-o32-4k-rand-read.txt
diskspd.exe -t1 -b4k -r -w0 -o64 -d60 -Sh -L #1 > c:\log\t1-o64-4k-rand-read.txt
diskspd.exe -t1 -b4k -r -w0 -o128 -d60 -Sh -L #1 > c:\log\t1-o128-4k-rand-read.txt
We’ve got some numbers, so, let’s take a closer look at them. Initially, look at FIO and DiskSPD plots. Hyper-V Virtual Disk performance goes to 148K-151K IOPS, while the average disk latency is around 1.72 ms. Within our setup these results can be observed under any number of threads, so we decided to carry out measurements in 4 threads keeping Outstanding I/O = 64.
Configuring testing tools
We derived the optimal parameters for our measurements from the experiments above.
So once again, we used DiskSPD, v2.17, and FIO, v3.5 for testing. Measurements were held under varying Queue Depth (QD) value in the following patterns:
– 4k random write
– 4k random read
– 64k random write
– 64k random read
– 8k random 70%read/30%write (OLTP production workload)
– 1M sequential read (File server & streaming)
And, here come test utilities parameters: thread=4, Outstanding I/O=8, time=60sec
diskspd.exe -t4 -b4k -r -w100 -o8 -d60 -Sh -L #1 > c:\log\4k-rand-write.txt
diskspd.exe -t4 -b4k -r -w0 -o8 -d60 -Sh -L #1 > c:\log\4k-rand-read.txt
diskspd.exe -t4 -b64k -r -w100 -o8 -d60 -Sh -L #1 > c:\log\64k-rand-write.txt
diskspd.exe -t4 -b64k -r -w0 -o8 -d60 -Sh -L #1 > c:\log\64k-rand-read.txt
diskspd.exe -t4 -b8k -r -w30 -o8 -d60 -Sh -L #1 > c:\log\8k-rand-70read-30write.txt
diskspd.exe -t4 -b1M -s -w0 -o8 -d60 -Sh -L #1 > c:\log\1M-seq-red.txt
[4k rnd write]
[4k random read]
[64k rnd write]
[64k random read]
[1M seq read]
Camera, motion… ACTION!
So, all preparations are over. We can do some real stuff now. First, let’s study how S2D performance changes based on the number of VMs in the cluster! In this study, we started with one VM running in Hyper-V Virtual Disk and benchmark its performance under various performance patterns. Next, we’ll assigne one more VM, but will pin it to the different host, and so on. In other words, we tried to estimate how many dicks S2D can take😊. Well, let’s hope it turns out more than Sasha Grey used to 😊. At this point, it may be necessary to come up with a “stop-word”, or so. BDSM rules 🙂 But, who gives a fuck? We’ll stop only when S2D performance stops growing. Here, we used Round Robin, so S2D should effectively leverage workloads within the cluster.
Wait, before Nutanix CE and VMware vSAN did not go further 12 VMs. Let’s start with 12-VM warm up to see whether our today’s competitor can do better. Just to make sure tests are backward compatible and we won’t compare apples to oranges.
Now, let’s talk about CPU utilization. In hyper-converged infrastructures, CPU usage is a crucial parameter because once it get’s overwhelmed, cluster performs at the snail’s speed. In other words, if your CPU gets entirely utilized by I/O, there will be no compute power left for something else, I mean your production VMs. As you remember, we use 2x Intel Xeon E5-2683 v3 @ 2.00 GHz with 14 cores per “stone”. Each core, in its turn, is split into 2 logical ones (Kind of… It’s not IBM POWER or Sun SPARC with hardware threads, unfortnately…), so we have 56 logical cores in total.
Damn, this son of a bitch does better than Nutanix CE and VMware vSAN! Nutanix and VMware suck ass and choke with balls, dirty motherfuckers 🙂 . Here, S2D exhibits close-to-linear performance growth under 4k and 8k loads. So, we’re excited to assign even more VMs. So, let’s see how many VMs S2D cluster can actually fit before its performance stops growing. Let’s see…
Again, CPU usage is important. So, let’s take a look at its utilization with 20 VMs running in cluster.
After studying all those plots and tables, let’s discuss what they actually mean. Under 4k random reading and 4k random writing patterns, S2D performance starts chocking up once the 20th VM gets provisioned. Things look similar for 8k random 70%read/30%write loads. Still, we stopped further measurements because there was no performance growth under other patterns. Under other patterns, there were even less VMs in the cluster before the performance saturated. And, just look at what happens to CPU! That looks like S2D cannot load balance… NUMA node 0 gets stuck at 100% load and NUMA node 1 basically jerks off at the same time… Microsoft could do a better job for that!
The Window Shot: Testing Single NVMe Disk Performance
Now, let’s see what we can squeeze out of 8-NVMe datastore! First, we estimated single Intel SSD DC P3700 2TB performance in the Windows Server 2016 environment under the same set of test patterns. This allows us to figure out whether pooling 8 NVMes together boosts S2D overall performance (or not).
Before we go further, we want to discuss how the expected performance values were calculated. Let’s assume that S2D is a good SDS (software-defined storage not shit-defined storage😊) that can read simultaneously from 8 disks, ok? So, the overall storage pool reading performance should be:
Now, let’s look at writing. Wait, not that fast, do you remember that Two-way mirroring? Well, at this point, let’s do some harder Math to estimate performance:
8 stands for a number of disks, and 2 addresses number of mirrors.
In other words, we can read from all disks in the underlying storage at a time. Writing performance, it its turn, won’t go beyond half of the disks performance. The diagram below depicts the process.
Now, you are ready for some DAMN HARDCORE Math to estimate the expected performance value for mixed (8k random 70%read/30%write) loads:
Now, let’s go to the tests!
We believe that we’ve seen enough for today and got enough info to carry out some conclusions, so, let’s jump directly there.
After all, that’s time to sum things up. We observed a remarkable performance increase while testing 4k reading and writing performance (1.25M and 730K IOPS respectively). Things had been looking good ‘till the 20th VM was assigned to the cluster. It allows running almost 5 VMs on each host ‘till performance becomes close to maximum. As for other patterns, performance fluctuates or even drops.
Talking about the overall performance, we feel fucking… Surprised! Finally, after so many years in IT business doesn’t understand storage still, but they seem to succeed in making storage deliver! Love it or hate it, but S2D did a really good job (without blowing):
4k random write – 43%(DiskSPD), 53%(Fio) of expected performance
4k random read – 35%(DiskSPD), 28%(Fio) of expected performance
64k random write – 61%(DiskSPD), 46%(Fio) of expected performance
64k random read – 64%(DiskSPD), 84%(Fio) of expected performance
8k random 70%read/30%write – 44%(DiskSPD), 31%(Fio) of expected performance
1M seq read – 69%(DiskSPD), 76%(Fio) of expected performance
Is this the new high-score in the town? Do we have our new king of the hill? Yep! Congratulations to Microsoft! With Storage Spaces Direct (S2D) you guys managed to create something remarkable people are not sick of anymore 🙂 .