HCI Software performance increases

With many of the hyper-converged platforms on the market, one of the key propositions is that the value of the solution is around the software, with the hardware being off the shelf basic components available from just about anyone.  Since all the high end features are driven exclusively by software, overall improvements in performance should occur with most releases as further optimizations occur.

In this particular case, I wanted to put the Choice Nutanix lab gear to the test.  I setup a test workload to see the perform delta working through the last year + of major releases. Originally I just threw some pictures on twitter, but that generated follow up conversations on how I arrived at the numbers I did, so I figured a follow up post was worth the time.

Specifically I setup the following

Testing methodology


The purpose of these VMs was not to prove how fast the Nutanix hardware is, rather to show the delta in speed.  It’s worth noting that I don’t claim to be a SQL super guru, but these tests serve my purposes well of generating stress against a system.  The settings I used were

VM Name VM Size Software Configuration
test-io01 12 vCPU (2×6)
64GB RAM
200GB HDD
IOMeter 12 workers
150 GB test file
IO queue of 4
full random data
16KB 50% read 50% random
test-io02 2 vCPU (2×1)
4GB RAM
40GB HDD
IOMeter 2 workers
2.5 GB test file
IO queue of 1
repeating data
8KB 25% read 75% random
test-io03 2 vCPU (2×1)
4GB RAM
2TB HDD
IOMeter 2 workers
100 GB test file
IO queue of 2
repeating data
1MB 100% read 0% random
1TB cold data
test-sql01 8 vCPU (2×4)
16GB RAM
40GB HDD
SQLIO -kW -fsequential -t8 -o8 -b8 -LS, 25GB test file
test-sql02 8 vCPU (2×4)
16GB RAM
40GB HDD
SQLIO -kW -frandom -t8 -o8 -b8 -LS, 25GB test file
test-sql03 6 vCPU (1×6)
24GB RAM
40GB HDD
SQLIO -kW -fsequential -t8 -o8 -b8 -LS, 25GB test file

General testing involved the following on each VM

  1. Create the test IO file on each machine with a manual run
    1. On file server, manually copied over 1TB of data to force data placement outside of SSD tier
  2. Execute all workloads, wait >20 min for steady state to be achieved
    1. After code upgrades, wait until cluster has finished rebuild mode and is 100% healthy
  3. Restart each workload so counters are based off steady state performance
  4. Wait ~5 minutes before sampling performance statistics

All the code upgrades completed in < 30 minutes.  As each code upgrade causes the CVMs to reboot individually, data redundancy correction needs to occur, and it took < 60 minutes for the extent protection to finish cleaning up.

Baseline: NOS 3.1.3.3


I elected to start with the 3.1.  Aggregate performance was around 17K IO/s and 700 MB/s

upgrade 01

The view from the applications

IO 1
upgrade 02

IO 2
upgrade 03

IO 3
upgrade 04

SQL 1
upgrade 06

SQL 2
upgrade 07

SQL 3
upgrade 08

Upgrade to NOS 3.5.5


The next major release was in the 3.5 series.  This release added a new HTML5 interface, deduplication in flash (as an option), 2012-R2 and vSphere 5.5 suppport, replication compression, and general performance improvements.  Post upgrade, the system was up to 975 MB/s at 25K IO/s.  I’m not 100% sure, but I believe this was the release where the oplog was spread across both SSDs instead of using just one.

upgrade 09

The view from the applications

IO 1
upgrade 10

IO 2
upgrade 11

IO 3
upgrade 12

SQL 1
upgrade 13

SQL 2
upgrade 14

SQL 3
upgrade 15

Upgrade to NOS 4.0.3.1


The next major release was the 4.0 series.  This release added prism-central support, web based code upgrades, block aware data placement, replication factor of 3, capacity tier deduplication (as an option), and beta aws replication.   Post upgrade, the speed dropped a little bit, (presumably from additional tracking overhead for one of the new features) down to 24K IO/s and 850 MB/s.

upgrade 16

The view from the applications

IO 1
upgrade 17

IO 2
upgrade 18

IO 3
upgrade 19

SQL 1
upgrade 20

SQL 2
upgrade 21

SQL 3
upgrade 22

Upgrade to NOS 4.1.1.4


The next release was the 4.1 series.  The big changes in this was enabling for data encryption at rest, metro availability (synchronous replication), and integrated hypervisor upgrades from PRISM.  Obviously some efficiency improvements as well, as the workload now reaches 28K IO at 925 MB/s.

upgrade 23

The view from the applications

IO 1
upgrade 24

IO 2
upgrade 25

IO 3
upgrade 26

SQL 1
upgrade 27

SQL 2
upgrade 28

SQL 3
upgrade 29

Upgrade to NOS 4.1.2.1


The current public code at this point is at 4.1.2.1, which is a maintenance release.  Performance increased slightly from the general 4.1 code.

upgrade 30

Summary


In the end, the same HW physical gear went from generating 17K IO/s and 700 MB/s to 28K IO/s at 950 MB/s+.  While I forgot to capture screen shots for proof, the controller vm cpu usage is actually *lower* in NOS 4.1 handling 28K IO/s than in NOS 3.1 at 17K IO/s.

Not bad for free upgrades.

Advertisements

3 thoughts on “HCI Software performance increases

  1. Pingback: Successful ROI/TCO modeling of hyperconverged infrastructure |

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s