StarWind Virtual SAN: Can I play with madness… RDMA?!?

Target

StarWind Virtual SAN

https://www.starwindsoftware.com/starwind-virtual-san

Once upon a time there was a… “A king!” my little readers will say. No, children not a king! Once upon a time there was a… Actually, you little drops o’ ewo have a point. A king… No… A. HWEMKPI. KING. OF. THE. HILL.

We wrote a long and painful read on what StarWind Virtual SAN actually is here. Long story short: Several big brown bears (BBBs, he…) met somewhere in the deep Siberian woods to hang, sniff some nice stuff and do some coding. They settled down with a icpidcpi party (After drinking lots of vodka, of course!) and… This is how StarWind VSAN was born. Basically, a beer can (Empty vodka bottle?) with a rocket engine running circles around any closest competitor when it comes to performance. A hell of a difficult to manage and scale when it comes to management and scaling (Sic! UI/UX is for queers, right? CLI once and forever!). Available for free without the GUI, PowerShell is “in da house” even here.  Hwem Microsoft for what they did to Windows NT ancestry! “You…Dcuvctfu!” (In Kyle’s voice). Oh, and not giving any ujkv about the hardware and the software as well… When it comes to servers and operating systems StarWind can fly high on whatever you have salvaged from your previous abandoned projects and failed learning curves. When it comes to network gear and protocols, oh, boy… Only drunk Russian bears could make such a weird thing… Work over TCP/IP! See, we aren’t in Kansas anymore, it’s not 1990 and it’s not 2010 either. We don’t blow in these tiny 10 gig Ethernet tubes, dual port 100 GbE NICs are below $500, and less than a $5K in cash can buy you a decent 100 GbE switch. TCP/IP and 100 gig Ethernet don’t mix well regardless of what your grandma will tell you, you need RDMA to make this iqffcop thing shine!  RoCE (V1? V2?!?) or iWARP, doesn’t matter, but no TCP and no these small fkems 1,500-byte frames, please! These are for kids and VMware VSAN retards only. VMware lives under the rock with a thumb in the cuu and doesn’t know there’s RDMA, nested virtualization, airplanes dnqylqds and gay marriages. OK, whatever… So, there’s something we didn’t test in the previous part… And this “something” is… StarWind over RDMA! How fast drunken bears can run on steroids?

Status

Unknown. Drunken bear is playing balalaika and lgtmkpi qhh instead of doing rocket science. One more expedition is needed 😊

Mission

This is effectively our second part of the study. Before, we challenged StarWind VSAN pumping data over iSCSI (TCP). And, you know what? VSAN over ancient iSCSI (TCP) performed much better in general than Microsoft Storage Spaces Direct (S2D) did over the hwemkpi great SMB3 (RDMA). At least StarWind VSAN could handle load balancing in the right way keeping all NUMA nodes busy while S2D was playing the stretching game by putting all the fkems and their load into a single hole… NUMA node. Shame on Microsoft for trying to steal Sasha Gray’s glory! Plus, StarWind could achieve close-to-maximum performance metrics with just 3 or 4 running VMs, while Microsoft required pretty much “full house”. And, as a final ewo on S2D face StarWind VSAN scaled much smoother and more predictably at the time when Storage Spaces Direct behaved like a hwemkpi roller coaster. OK, whatever…

Now, let’s take a closer look at StarWind Virtual SAN performance over iSER (RDMA). Remote Direct Memory Access, remember? No CPU involved… All to make things faster! Like we did it before, we’ll use 8 Intel DC P3700 2TB NVMe Enterprise flash drives scattered equally among 4 server nodes as a datastore to achieve maximum possible combined performance. We’ll keep on increasing the number of I/O generating VMs running in the cluster until its overall performance will hit saturation point.

Considerations and milestones

By design, StarWind VSAN is fully and “natively” (Not naïvely, fwodcuu!) integrated into Windows Server operating system or Hyper-V hypervisor. VMware story is quite different, but who the hwem cares about VMware now? So, we run a bloody mix of Windows Server, Hyper-V, and StarWind VSAN on an all-NVMe datastore that should deliver impressive performance. That’s a scientific guess if you care. Experiment will prove. Or deny… As usual, we expect to get the overall cluster performance gain every time an extra I/O generating VM gets on board.

Here’s what we gonna do:

1.Check whether underlying storage performs as it should. Measure single Intel SSD DC P3700 “raw” performance in “bare metal” Windows Server 2016 environment. True, we have a bunch of them, but measuring only one NVMe performance should be enough. We aren’t sending anybody to Moon, heh? Same thing we did for our previous TCP test, but it’s here again just to make things look solid.

2. Deploy StarWind VSAN on each cluster host. Enable Hyper-V role and configure MPIO. Today, we do RDMA and iSER instead of TCP and iSCSI 😊

3. Create a distributed StarWind virtual device replicated between just two hosts (Host #1 and Host #2). This time StarWind will handle “guest” connections from Hyper-V to remote virtual storage server over iSCSI (TCP). See, there’s no iSER support within Windows Server built-in iSCSI initiator and StarWind didn’t bother to write their own kernel-mode one which is very unfortunate, TBH… StarWind will synchronize data, metadata, keep in-memory caches coherent and acknowledge “guest” writes over iSER (RDMA) using StarWind’s built-in user-land iSER initiator, and it will handle loopback iSCSI (TCP) connections over DMA ignoring local Windows TCP/IP stack overhead. ATTENTION! It’s not 100% RDMA as Windows has no iSER! Some noticeable part of the network traffic still goes over iSCSI (TCP).

4. Connect StarWind virtual device on Host#1 to the one on Host#2 (Two 127.0.0.1 sessions and two iSCSI sessions with the Host#2). See (3) about loopback TCP optimization! Round Robin MPIO policy is used by default. Next, format resulting virtual device as NTFS and assign mount point drive letter to it.

5. Create a Hyper-V VM with “system” virtual disk kept on the local Windows host and an extra “data” 256 GB test VHDX stored on StarWind virtual device. The later gonna be our working horse bear today 😊

6. Decide on the optimal test utilities and their launch parameters: Number of I/O threads and outstanding I/Os as well.

7.  Measure StarWind virtual device performance with one VM running in the cluster.

8. Clone this VM to another Hyper-V host. Measure the overall cluster performance and clone the VM again. For each new VM new StarWind virtual device and “data” VHDX were created.

9. Clone VM, measure performance, and clone it again… And again, until the total cluster performance hits the saturation point. Keep an eye on CPU utilization as what we do is expected to be somebody’s HCI setup.

10. Test the single Intel SSD DC P3700 2TB performance in Windows Server 2016 “bare metal” environment. This will be kinda of a reference to judge on the performance VSAN potentially can deliver.

Pre-launching

Hardware we use

First, take a look at the setup configuration we used to check whether an NVMe disk performs just as its vendor says:

1 Dell R730, 2x Intel Xeon E5-2683 v3 @ 2.00 GHz CPU (14 physical cores per-CPU), RAM 64GB
Storage: 1x Intel SSD DC P3700 2TB
OS: Microsoft Windows Server 2016 Datacenter 10.0.14393 N/A Build 14393

Each host used for StarWind VSAN performance testing looked like this:

Host #1, Host #2, Host #3, Host #4
Dell R730, CPU 2x Intel Xeon E5-2683 v3 @ 2.00 GHz , RAM 64GB
Storage: 2x Intel SSD DC P3700 2TB
LAN: 2x Mellanox ConnectX-4 100GB CX465A
StarWind_8.0_R6_20180507_12166_6395_405_R6U2-release
Hypervisor: Hyper-V
OS: Microsoft Windows Server 2016 Datacenter 10.0.14393 N/A Build 14393

The scheme below makes interconnections clear:

StarWind+Hyper-V

Does the real Intel DC P3700 2TB performance match the vendor-claimed one?

Well, we do trust Intel and its datasheet. But, we’d like to doublecheck. “Everybody lies!” (in Hugh Laurie’s voice). Also, we’d be happy to find out that at least one thing in this fcop lab works as it should, LOL.

So, here is the table with some official Intel-provided numbers from this datasheet:

The vendor claims Intel SSD DC P3700 2TB has to deliver up to 460K IOPS for reads with 4 workers under Queue Depth = 32.

Now, let’s take a look at the real performance we can get under 4k random read pattern. As usual, we measured performance with DiskSPD v.2.17 and Fio v.3.5. Please find testing results below:

Performance Intel SSD DC P3700 2TB (RAW)DiskSPD (4k random read)

Performance Intel SSD DC P3700 2TB (RAW)FIO (4k random read)

Mini-conclusion

DiskSPD and Fio show that Intel SSD DC P3700 2TB reaches the “raw” performance listed in the datasheet without compromising the latency. Well, everything looks exactly as we expected. Let’s move on.

Checking network bandwidth

Once Windows Server 2016 was installed on all hosts, we installed the WinOF-2 (1.90.19216.0) driver for Mellanox ConnectX-4 NICs.

Next, we’ve set up the networks and checked the network bandwidth between Windows hosts (OK, let it be Host #1 and Host #2) with iperf for TCP and nd_send_bw for RDMA. The latter is available in MFT (Mellanox Firmware Tools).

Let’s look at RDMA networking bandwidth first. We do RDMA today, so we DO CARE ABOUT IT NOW!

Next, we’ll check out network bandwidth between hosts (host #1 and host #2, remember?) for RDMA:

And, in the end, that’s time to check how TCP networking bandwidth looks like. Remember, some connections are still iSCSI (TCP) today, so we want both RDMA and TCP fly… We did it before here. But it is a good idea to ensure everything works as it should:

Mini-conclusion

Now, with those 460K IOPS/disk under 4k random read pattern and 2 NVMe drives/host in mind, let’s do some quick math to ensure that 90 Gbit/s per port won’t let us down today.

In our lab, we use Mellanox MSN2100. That thing is based on Mellanox Spectrum ASIC processor capable of delivering the impressive 3.2 Tb/s (410 GB/s) bandwidth. Looks good, right? Under the 4K random read pattern, the overall performance, in the best case, will reach 3.6 M IOPS (460 K IOPS * 8 NVMe = 3.6 M IOPS). Converted into bandwidth it is 14.04 GB/s, or 112.32 GB/s. So, the switch we have won’t bottleneck today’s study.

Now, let’s inspect ports! nd_send_bw says that each Mellanox ConnectX-4 can deliver 90Gbit/s. Obviously, two ports should ensure up-to-180 Gbit/s (22.5 GB/s) connectivity. In this way, single Windows host won’t go beyond 5.9 M IOPS ((23 GB/s * 1024 * 1024)/4=5.9 M IOPS). Two drives, in their turn, should exhibit 920 K IOPS (460 K IOPS * 2). 5.9 M >> 920 K, problems, officer?

Therefore, 90 Gbit/s per port is enough for today’s study as well.

Enabling iSER

We run performance tests over RDMA today. So, we need to enable it in VSAN too. For that purpose, go to Configuration and click the Network option. The RDMA-friendly NICs will be listed there. Now, enable iSER. To do so, select an adapter, press Modify and check the “Enable iSER for this Interface (experimental)” checkbox and press OK:

Launching

Let’s create a VM

Why do we use all-NVMe datastore? Why did we go with all these weird-looking VM parameters? Well, there’s a good reason for that… Wait, why don’t you check S2D testing script out? There we describe why we do everything we do in a great amount of details! So… VM!

– 4x vCPU
– RAM 7GB
– Disk0 (Type SCSI) – 25GB (“system” OS Windows Server 2016)
– Disk1 (Type SCSI) – 256GB (Hyper-V Virtual Disk 256GB “data” VHDX)

NOTE. Even though we used only fixed virtual disks today, we completely filled them with random garbage with dd.exe before running each test. It’s a good idea to do so while creating or adjusting VM virtual disk’s size.
So, here are dd.exe launching parameters:
dd.exe bs=1M if=/dev/random of=\\?\Device\Harddisk1\DR1 –progress

Picking test utility launching parameters

To pick the right number of I/O threads and outstanding I/Os, we created a VM and pinned it to the Host #1. Then, we measured StarWind virtual device performance under 4K 100% random all-read pattern and varying numbers of I/O threads and outstanding I/Os. At some point, the performance hit the ceiling and saturated. That was the point with the optimal number of outstanding I/Os and I/O threads we were looking for!

Here are DiskSPD launching parameters under threads=1 and outstanding I/O=1,2,4,8,16,32,64,128

diskspd.exe -t1 -b4k -r -w0 -o1 -d60 -Sh -L #1 > c:\log\t1-o1-4k-rand-read.txt
timeout 10
diskspd.exe -t1 -b4k -r -w0 -o2 -d60 -Sh -L #1 > c:\log\t1-o2-4k-rand-read.txt
timeout 10
diskspd.exe -t1 -b4k -r -w0 -o4 -d60 -Sh -L #1 > c:\log\t1-o4-4k-rand-read.txt
timeout 10
diskspd.exe -t1 -b4k -r -w0 -o8 -d60 -Sh -L #1 > c:\log\t1-o8-4k-rand-read.txt
timeout 10
diskspd.exe -t1 -b4k -r -w0 -o16 -d60 -Sh -L #1 > c:\log\t1-o16-4k-rand-read.txt
timeout 10
diskspd.exe -t1 -b4k -r -w0 -o32 -d60 -Sh -L #1 > c:\log\t1-o32-4k-rand-read.txt
timeout 10
diskspd.exe -t1 -b4k -r -w0 -o64 -d60 -Sh -L #1 > c:\log\t1-o64-4k-rand-read.txt
timeout 10
diskspd.exe -t1 -b4k -r -w0 -o128 -d60 -Sh -L #1 > c:\log\t1-o128-4k-rand-read.txt
timeout 10
diskspd.exe -t1 -b4k -r -w0 -o256 -d60 -Sh -L #1 > c:\log\t1-o256-4k-rand-read.txt
timeout 10

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device4k random read (DiskSPD)

Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device – 4k random read (DiskSPD)
threads=1 threads=2 threads=4 threads=8
IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms)
QD=1 2492 10 0.40 5633 22 0.35 11282 44 0.35 18910 74 0.42
QD=2 5626 22 0.35 11207 44 0.36 18905 74 0.42 43223 169 0.37
QD=4 10986 43 0.36 18880 74 0.42 43516 170 0.37 58555 230 0.54
QD=8 18846 74 0.42 44058 172 0.36 58500 229 0.55 91117 356 0.70
QD=16 44364 173 0.36 59697 233 0.54 109040 426 0.59 170011 664 0.75
QD=32 58398 228 0.55 106900 418 0.60 185946 726 0.69 226463 885 1.12
QD=64 103042 403 0.62 165834 648 0.77 231081 903 1.11 229460 896 2.23
QD=128 159476 623 0.80 226960 887 1.12 226674 885 2.56 232828 909 4.39
QD=256 161282 630 1.58 245380 958 2.08 230685 901 4.44 237039 926 8.63

Performance Hyper-V Virtual Disk 256GB (RAW)4k random read (FIO)

Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device – 4k random read (FIO)
threads=1 threads=2 threads=4 threads=8
IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms)
QD=1 2425 9 0.40 5541 22 0.35 11119 43 0.35 18602 73 0.42
QD=2 5400 21 0.36 10994 43 0.35 18657 73 0.42 41587 162 0.38
QD=4 10863 42 0.36 18681 73 0.42 41816 163 0.37 60120 235 0.53
QD=8 18455 72 0.42 41947 164 0.37 58510 229 0.54 97073 379 0.65
QD=16 41099 161 0.38 58285 228 0.54 99181 387 0.63 171917 672 0.73
QD=32 59526 233 0.53 101795 398 0.62 174465 682 0.71 221960 867 1.10
QD=64 101346 396 0.61 165704 647 0.75 229765 898 1.02 239446 935 2.04
QD=128 157785 616 0.74 196404 767 1.18 237533 928 1.98 244690 956 4.01
QD=256 173926 679 1.13 218647 854 2.06 238649 932 3.92 244988 957 8.05

Mini-conclusion

Now, let’s try to “interpret” the numbers. Test Hyper-V virtual disk on top of StarWind virtual device maximum performance stays in between 221,960 – 231,081 IOPS. The latency, in its turn, varies from 1,02 through 1,12 ms. In our setup, we can reach this performance with 4 and 8 I/O threads. Remember our VM has 4-core vCPU? That’s why we used the following test utilities launching parameters: threads=4 and Outstanding I/O=64.

Entering (Hyper?)Space

Setting up tools

Now, that we know optimum test utilities parameters, let’s set up these utilities and do some real measurements. As usual, we used DiskSPD v2.17 and Fio v3.5 today. We run the tests under the following patterns:

– 4k random write
– 4k random read
– 64k random write
– 64k random read
8k random 70%read/30%write
1M sequential read

Now, take a look at test utilities launching parameters under thread=4, Outstanding I/O=8, time=60sec

DiskSPD
diskspd.exe -t4 -b4k -r -w100 -o8 -d60 -Sh -L #1 > c:\log\4k-rand-write.txt
timeout 10
diskspd.exe -t4 -b4k -r -w0 -o8 -d60 -Sh -L #1 > c:\log\4k-rand-read.txt
timeout 10
diskspd.exe -t4 -b64k -r -w100 -o8 -d60 -Sh -L #1 > c:\log\64k-rand-write.txt
timeout 10
diskspd.exe -t4 -b64k -r -w0 -o8 -d60 -Sh -L #1 > c:\log\64k-rand-read.txt
timeout 10
diskspd.exe -t4 -b8k -r -w30 -o8 -d60 -Sh -L #1 > c:\log\8k-rand-70read-30write.txt
timeout 10
diskspd.exe -t4 -b1M -s -w0 -o8 -d60 -Sh -L #1 > c:\log\1M-seq-red.txt
FIO
[global]
numjobs=4
iodepth=8
loops=1
time_based
ioengine=windowsaio
direct=1
runtime=60
filename=\\.\PhysicalDrive1

[4k rnd write]
rw=randwrite
bs=4k
stonewall

[4k random read]
rw=randread
bs=4k
stonewall

[64k rnd write]
rw=randwrite
bs=64k
stonewall

[64k random read]
rw=randread
bs=64k
stonewall

[OLTP 8k]
bs=8k
rwmixread=70
rw=randrw
stonewall

[1M seq read]
rw=read
bs=1M
stonewall

Testing 4-node StarWind VSAN cluster performance, now with RDMA “boost” (“booze”?)

As we got here, everything should already be considered, calculated, and set up. Bears stopped hwemkpi each other and set their balalaikas aside and are impatiently waiting for testing results (Sorry… Still drinking vodka, of course!) so let’s jump to the real stuff now and run performance tests on the “data” VHDX under the mentioned patterns. We started with one VM, cloned it to another host, and measured the total cluster performance. Boring like hell, we know! We kept on going until the overall performance reached the saturation point. Note that every VM had its own StarWind virtual device beneath to keep its private “data” VHDX intended for tests on top of it (%%EDITOR%%: I hwemkpi hate this German guy for his sentences like that!!! I won’t touch this one!!!).

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device4k random write (MB/s)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device4k random read, (IOPS)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device4k random read, (MB/s)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device64k random write, (IOPS)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device64k random write, (MB/s)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device64k random read, (IOPS)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device64k random read, (MB/s)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device8k random 70%read/30%write, (IOPS)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device8k random 70%read/30%write, (MB/s)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device1M seq read, (IOPS)

Performance Hyper-V Virtual Disk 256GB (RAW) over StarWind virtual device1M seq read, (MB/s)

Now, let’s check CPU usage. It is a vital metric for hyper-converged (HCI) environments, and here’s why: With CPU overwhelmed with I/O, your production VMs won’t have CPU cycles for production workload left and will perform just as fast as Nutanix performs… We mean they will be getting HWEMKPI SLOW. So, that’s a good idea to keep an eye on the CPU utilization. Always!

Just as in previous part, 20 VMs have completely overwhelmed the processor and there are no free CPU cycles left. But, unlike S2D, StarWind knows what NUMA is and how to juggle with vodka bottles balance IOPS!

4k random write 4k random read
DiskSPD FIO DiskSPD FIO
IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms)
1x VM 177665 694 1,44 173648 678 1,40 219539 858 1,17 222999 871 1,04
2x VM 270620 1057 1,92 267878 1046 1,81 389960 1524 1,34 389267 1521 1,18
3x VM 422926 1652 1,84 411585 1608 1,76 572611 2237 1,35 567432 2217 1,20
4x VM 477518 1865 2,16 468635 1831 2,04 679861 2655 1,51 685616 2678 1,29
5x VM 547099 2138 2,35 519350 2029 2,34 805367 3145 1,60 786108 3071 1,44
6x VM 597689 2335 2,59 566775 2214 2,61 913999 3570 1,69 870687 3401 1,59
7x VM 605544 2365 2,98 589840 2304 2,94 1002352 3915 1,79 975060 3809 1,64
8x VM 637843 2490 3,24 623829 2437 3,18 1101700 4302 1,87 1076433 4205 1,71
9x VM 675248 2638 3,44 680838 2660 3,28 1189049 4644 1,95 1095348 4279 1,95
10x VM 694318 2713 3,74 614587 2401 4,15 1289541 5037 2,03 1181806 4616 1,99
11x VM 590147 2306 4,87 605746 2366 4,66 1249911 4882 2,32 1191692 4655 2,21
12x VM 544036 2125 5,76 588898 2300 5,24 1288026 5032 2,45 1237834 4835 2,30
13x VM 658859 2572 5,21 681175 2661 4,85 1420980 5552 2,45 1260059 4922 2,44
14x VM 594762 2323 6,30 708258 2767 5,04 1430015 5586 2,55 1304074 5094 2,57
15x VM 738033 2883 5,81 598029 2336 6,53 1624657 6346 2,45 1356875 5300 2,63
16x VM 631882 2468 6,88 629494 2459 6,79 1542500 6025 2,69 1363315 5326 2,81
17x VM 764657 2987 6,08 607914 2375 7,34 1671774 6530 2,68 1382280 5400 3,00
18x VM 623631 2436 8,37 640978 2504 7,65 1621549 6334 2,92 1394228 5446 3,17
19x VM 580753 2269 8,94 599373 2341 8,13 1580498 6174 3,14 1486086 5805 3,09
20x VM 563615 2202 10,03 640818 2503 8,05 1625181 6348 3,22 1502052 5868 3,23
threads=4 Outstanding I/O=64 threads=4 Outstanding I/O=64 threads=4 Outstanding I/O=64 threads=4 Outstanding I/O=64
64k random write 64k random read
DiskSPD FIO DiskSPD FIO
IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms)
30719 1920 8,32 29357 1835 8,71 71925 4495 3,56 68037 4252 3,75
35942 2247 14,74 29063 1817 20,01 146732 9170 3,49 137073 8567 3,70
49620 3101 18,96 40547 2535 31,57 216787 13549 3,54 205857 12866 3,70
53278 3329 25,41 58319 3645 23,23 244405 15276 4,49 247396 15463 4,26
58501 3657 30,28 56110 3508 28,11 274774 17172 4,91 257264 16079 5,42
60715 3795 28,39 58488 3656 38,34 287040 17940 5,64 259447 16216 6,30
56691 3543 35,14 52480 3281 40,80 267516 16719 6,77 258902 16182 7,13
64051 4003 38,19 49431 3090 52,24 291110 18194 7,44 281409 17589 7,42
66741 4169 42,13 59436 3716 45,23 305519 19094 7,77 299082 18694 8,13
66104 4132 40,17 63581 3975 43,44 325308 20392 8,19 280611 17540 9,43
69838 4362 43,02 61381 3837 55,29 283300 17705 10,40 260878 16306 11,59
58682 3666 56,65 61797 3864 58,24 270717 16924 11,77 273521 17096 11,41
67566 4223 54,57 63786 3989 57,77 320598 20037 10,83 296445 18529 11,66
69114 4320 58,47 62340 3898 69,26 305537 19096 12,69 316043 19754 11,67
86412 5401 50,94 65673 4107 64,54 377051 23566 11,31 282180 17638 13,88
69413 4338 63,69 68255 4268 69,41 321642 20103 13,45 293096 18320 14,26
77851 4866 60,41 69922 4372 68,83 347779 21736 13,34 288132 18010 15,53
77392 4837 65,80 73575 4600 69,55 316601 19788 15,40 301079 18820 15,74
76773 4798 69,20 67439 4217 79,54 278377 17399 18,38 265520 16597 19,37
79916 4995 68,77 65798 4115 84,51 280275 17517 18,91 298986 18689 17,68
threads=4 Outstanding I/O=64 threads=4 Outstanding I/O=64 threads=4 Outstanding I/O=64 threads=4 Outstanding I/O=64
8k random 70%read/30%write 1M seq read
DiskSPD FIO DiskSPD FIO
IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms) IOPS MB/s Latency (ms)
186603 1458 1,37 185455 1449 1,23 4700 4700 54,46 4636 4636 55,11
320694 2505 1,63 316839 2475 1,47 8245 8245 63,33 8234 8234 62,82
467022 3649 1,67 461037 3602 1,52 11087 11087 71,31 10787 10787 73,78
548402 4284 1,86 544844 4257 1,66 12026 12026 93,41 11901 11901 94,38
640572 5004 2,01 620714 4849 1,88 13807 13707 95,66 12964 12964 101,06
726064 5672 2,13 685582 5356 2,09 14636 14636 106,98 13155 13155 119,48
775710 6060 2,32 755371 5902 2,22 14022 14022 135,35 13954 13954 137,19
828378 6471 2,50 817673 6388 2,37 14913 14913 137,04 14945 14945 149,29
899963 6932 2,59 799647 6247 2,88 15880 15880 152,79 16866 16866 140,90
952964 7446 2,74 871426 6808 2,87 16428 16428 174,23 15302 15302 180,34
908937 7098 3,18 860388 6722 3,34 14922 14922 212,31 15327 15327 211,85
919749 6645 3,44 884467 6910 3,54 14302 14302 226,58 14867 14867 235,34
987880 7718 3,42 821790 6421 4,21 15061 15061 252,17 17341 17341 205,60
1000869 7819 3,64 791240 6182 4,82 17938 17938 256,29 19034 19034 218,33
1129929 8828 3,55 936292 7315 4,18 18856 18856 247,28 15716 15716 277,63
1028207 8033 4,06 918017 7172 4,58 15494 15494 355,60 16255 16255 298,75
1093492 8543 4,13 949885 7421 4,73 16838 16838 329,99 15982 15982 350,59
1085031 8477 4,40 927383 7246 5,21 15952 15952 360,95 16162 16162 335,02
1036030 8094 4,80 937047 7321 5,51 14342 14342 382,47 17232 17232 337,11
1029678 8044 5,09 899355 7027 6,15 14682 14682 458,68 19350 19350 340,29
threads=4 Outstanding I/O=64 threads=4 Outstanding I/O=64 threads=4 Outstanding I/O=64 threads=4 Outstanding I/O=64

Measuring single Intel SSD DC P3700 2TB performance in Windows Server environment

Now, let’s take a look at the single NVMe drive performance. For that purpose, we run the same set of tests but on a single Intel SSD DC P3700 2TB in Windows Server 2016 “bare metal” environment. At this step, we got theoretical values for the underlying datastore performance. It gives us kinda of a reference to judge whether running StarWind VSAN on that datastore makes or ruins Hyper-V cluster performance.

To avoid making this study fcop rocket science, we made several presumptions:

1. All 8 NVMe drives in the datastore are available for reading. So, the overall datastore reading performance should be 8x of Intel SSD DC P3700 2TB one.

2. All 8 drives are available for writing. Yet, due to each block replication to the partner disk, the overall performance should be ((IOPS-Write-one-disk)*N)/2. N here stands for the number of disks involved in writing (8 for this setup).

3. Under 8k random 70%read/30%write pattern, performance will be (IOPS-Read-one-disk*N*0.7)+((IOPS-Write-one-disk*N*0.3)/2) (in the best case). Again, N is the number of NVMe drives in the datastore.

DiskSPD
Intel SSD DC P3700 2TB
( Windows Server 2016)
Theoretical values for use 8x Intel SSD DC P3700 2TB Measured maximum performance value
Hyper-V Virtual Disk
over StarWind HA IMG
The ratio of measured performance to theoretical value
IOPS MB/s IOPS MB/s IOPS MB/s %
4k random write 418860 1636 1675440 6544 764657 2987 45.64
4k random read 445540 1740 3564320 13920 1671774 6530 46.90
64k random write 31192 1950 124768 7800 86412 5401 69.26
64k random read 38513 2407 308104 19256 377050 23566 112.38
8k random 70%read/30%write 230318 1800 1844355 7505 1129929 8828 61.26
1M seq read 2548 2548 20384 20384 18856 18856 92.50
threads=4
Outstanding I/O=64
threads=4
Outstanding I/O=64
Fio
Intel SSD DC P3700 2TB
( Windows Server 2016)
Theoretical values for use 8x Intel SSD DC P3700 2TB Measured maximum performance value
Hyper-V Virtual Disk
over StarWind HA IMG
The ratio of measured performance to theoretical value
IOPS MB/s IOPS MB/s IOPS MB/s %
4k random write 310104 1211 1240416 4845 708258 2767 57.10
4k random read 422274 1650 3378192 13200 1502052 5868 44.46
64k random write 31205 1950 124820 7802 73575 4600 58.94
64k random read 37069 2317 296552 18536 316043 19754 106.57
8k random 70%read/30%write 193268 1510 1964120 14843 949885 7421 48.36
1M seq read 2500 2500 20000 20000 19350 19350 96.75
threads=4
Outstanding I/O=64
threads=4
Outstanding I/O=64

 

Commencing return mission

Since we are done with measurements, that’s high time to present you one more mission report.

Under nearly all patterns, Hyper-V Virtual Disk over StarWind virtual device exhibits the close-to-linear performance growth up to 6 VMs running in the cluster. Things change once 7th VM is created. As to StarWind over iSCSI (TCP), we’ve managed to go with 4 VMs in the cluster and the 5th one caused slowdown. And, before we go further, here are the expected performance values under each pattern one more time:

What do we expect?
DiskSPD Fio
4k random write 1675440 1240416
4k random read 3564320 3378192
64k random write 124768 124820
64k random read 308104 296552
8k random 70%read/30%write 1844355 1964120
1M seq read 20384 20000

Now, let’s take a closer look at performance metrics we got this time and compare some apples to some oranges. The maximum 64K random reading performance reached 377K IOPS or 23 GB/s. We had 310K IOPS (19 GB/s) under the same pattern for StarWind over iSCSI (TCP). Performance under 1M seq read pattern reached impressive 19K IOPS (18 GB/s). With StarWind & TCP, we’ve managed to get 18K IOPS (17.5 GB/s). Performance under all these patterns saturated after adding 7th VM to the cluster, no major growth with adding extra I/O generating VM!

As for other results, everything looks surprisingly good except some “questionable” patterns. For instance, under 64K random writing, VSAN performance reached 86K IOPS (5 GB/s) and stopped growing remarkably once 7th VM joined the cluster. Previously, with TCP, we had 140K IOPS (8.5 GB/s). Under 8k random 70%read/30%write pattern, 1.1 M IOPS (8.6 GB/s) was the ceiling, while with TCP it was 751K IOPS (5.7 GB/s). Under 4k random write, we reached 764K IOPS (3 GB/s). As to TCP over 4K random write539K IOPS (2 GB/s). Performance under these patterns was growing up to 10 VMs engaged.

The only pattern where performance was going up even after 20 VMs was 4K random read. There, we reached 1.67 M IOPS (6.5 GB/s) with 17 VMs on the board. And… With StarWind over iSCSI (TCP) we had 1.7M IOPS or 6.6 GB/s. Sure, we could go beyond 20 VMs, but, say, was there a point of doing that? Performance gain became smaller and smaller under this pattern… To make things worse: It stopped growing completely under other patterns at the same time! So, should we do something else? Let’s just look at numbers for performance gain… Initially, while running from 1 through 9 machines in the cluster, the average performance gain was 121 K IOPS/VM. Next, from 10 through 20 VMs, we got only 39K IOPS/VM. This time, CPU also got overwhelmed. Its resource was completely consumed by I/O generating VMs leaving nothing to anybody else. Sure, there was proper load balancing, but there was no compute power left in the test. More VMs? Nope! The juice isn’t worth a squeeze!

Now, let’s take everything together. The table below provides the data you need to know about this vodka-powered SDS performance. We list the percentage of the theoretical values we could reach with VSAN over iSER (RDMA) and VSAN over iSCSI (TCP).

StarWind/RDMA StarWind/TCP
DiskSPD Fio DiskSPD Fio
4k random write 45 % 57 % 32% 39%
4k random read 47 % 44 % 47% 40%
64k random write 69 % 59 % 113% 112%
64k random read 122 % 106 % 101% 94%
8k random 70%read/30%write 61 % 48 % 41% 36%
1M seq read 92 % 96 % 80% 92%

Green. There’s a major performance “boost”.

Red. There’s a noticeable slowdown.

Black. Nothing changed except some minor fluctuation.

When we were testing StarWind over iSCSI (TCP) during our previous home run, it looked like a Russian bear broke in and started swinging with a 10-inch hairy brown fkem in front of some scared dudes with tiny 2-inch ones… However, it’s all in the comparison (You’ll see it all in the next article where we’ll try to finalize and pick up a winner, because you know, performance isn’t everything you should care about! Stay tuned!). But… Taken all the results into account and thoroughly analyzed the performance across various patterns over RDMA, we can sum up that the same bear took out the second fkem charged with RDMA now, just to give a good fggrvjtqcv for everyone (Double penetration for S2D?). Sounds weird? Those are hwemkpi drunken bears! What did you expect? Well, FINALLY fcop Russians did manage to create something exceptionally good. Yes, Russians kicked cuu again! But… We can tell you this game is not about how big your bat is… It’s about how good you’re in handling it! Please wait patiently for our upcoming “Summary” as you’re probably going to be surprised. Soon. Very soon!

4.5/5 (12)

Please rate this