AHV templating

When staring at an empty Nutanix AHV (or CE) cluster, a lot of first time customers ask ‘now what?’.  It’s a fair question; you have a clean, new environment, and want to make use of it.

If you’re migrating over existing virtual machines from some other form of infrastructure, you’re typically keeping them as-is, other than installing the basic virt-io drivers.  If you’re looking at this as an opportunity to clean house and upgrade components, this is a good time to get your templates built properly in a manner that lets your repeat the action as often as necessary.

For this example, we’ll start with Windows Server 2016.  It’s fairly new and most people haven’t gotten a system for it worked out yet, so it’s typically a good point to try this out.

I’m demoing this with the evaluation media, but if you have access to the real thing it should work as well.  You’ll also want to download a copy of the Nutanix virt-io driver ISO from the website.

I’m assuming you have either a valid AHV or CE cluster available to use.  My examples are on a single node CE running on a skull canyon NUC.

Continue reading


HCI Software performance increases

With many of the hyper-converged platforms on the market, one of the key propositions is that the value of the solution is around the software, with the hardware being off the shelf basic components available from just about anyone.  Since all the high end features are driven exclusively by software, overall improvements in performance should occur with most releases as further optimizations occur.

In this particular case, I wanted to put the Choice Nutanix lab gear to the test.  I setup a test workload to see the perform delta working through the last year + of major releases. Originally I just threw some pictures on twitter, but that generated follow up conversations on how I arrived at the numbers I did, so I figured a follow up post was worth the time.

Specifically I setup the following

Testing methodology

The purpose of these VMs was not to prove how fast the Nutanix hardware is, rather to show the delta in speed.  It’s worth noting that I don’t claim to be a SQL super guru, but these tests serve my purposes well of generating stress against a system.  The settings I used were

VM Name VM Size Software Configuration
test-io01 12 vCPU (2×6)
IOMeter 12 workers
150 GB test file
IO queue of 4
full random data
16KB 50% read 50% random
test-io02 2 vCPU (2×1)
IOMeter 2 workers
2.5 GB test file
IO queue of 1
repeating data
8KB 25% read 75% random
test-io03 2 vCPU (2×1)
IOMeter 2 workers
100 GB test file
IO queue of 2
repeating data
1MB 100% read 0% random
1TB cold data
test-sql01 8 vCPU (2×4)
SQLIO -kW -fsequential -t8 -o8 -b8 -LS, 25GB test file
test-sql02 8 vCPU (2×4)
SQLIO -kW -frandom -t8 -o8 -b8 -LS, 25GB test file
test-sql03 6 vCPU (1×6)
SQLIO -kW -fsequential -t8 -o8 -b8 -LS, 25GB test file

Continue reading

Removing a node from a Nutanix Cluster

With Nutanix, you have a cluster of nodes that typically represents a pool of storage.  As nodes age out, it’s possible you might want to pull one of the nodes out of the current cluster, usually to attach them to a different cluster.   It might be to send part of current cluster to DR for replication, a nearby data center for metro-cluster, to rotate older nodes out for test/dev, or to pull the node out of service entirely.  Regardless, the process to follow is identical.

Hypervisor preparation

Before pulling a node out, you’ll want to take care of the basics.  Does your hypervisor cluster have enough resources to sustain a node being removed and still meet your N+1/N+2 etc requirements?  Do you have enough capacity at the storage level to sustain removing a node?  If you can’t answer these questions, you probably shouldn’t follow past this point.

You’ll want to evacuate any VMs / resources from the host (don’t forget templates and such that dont may not be moved by DRS).  Evict the host from vCenter or Virtual Machine Manager control so that from a hypervisor management standpoint, the unit is stand alone.  At this point it’s still an active part of the Nutanix storage cluster.  For vSphere, you have to shutdown the controller VM temporarily to put the host in maintenance mode.  Once the host is removed from vCenter, take it out of maintenance mode and power the CVM back on, then wait for the cluster to hit steady state again before you proceed.

On the main homepage, you’ll want to see this before you proceed


Continue reading

Beware the crutch

A discussion point

crutchesI get to work on projects deploying new infrastructure fairly regularly where the topic of “best practices” is often cited by both vendor engineers and customers, with normally well meaning intentions.  Where it becomes dangerous is when those citing it don’t understand why it’s a best practice, or how it being “best” is measured.  Before I explain my logic why I think this is dangerous, let’s lay out what this means.

http://dictionary.reference.com (because if it’s on the internet, it must be true, right?)dictionary.com


The problem

Generally speaking, these definitions seem to describe the mental model that people have conceived of how a best practice works.  Wikipedia’s definition even calls out my exact gripe, which I’ve highlighted in orange.

At least in my little part of the IT realm, vendor best practice documents are also typically CYA material.  They are as much “you’re least likely to screw this up and/or complain” as they are truly a good idea, which is understandable when you pay attention to who does the writing and who the target audience is.  That this happens, while annoying, makes sense.

The issue is that when people don’t want to to think through a design or try to understand the complicated portions of an install.  Rather than using brain power, they simply cite “best practices” as the holy grail of how to do something.  Or worse, they make a choice that is being questioned and they hide behind the “best practice” like it’s a magical shield.

If someone asks you why you’re putting virtual machine disks on RAID5 with Fast Cache on an EMC VNX, you should be able to explain how caching offsets the RAID5 write penalty.  If someone asks why you only have 2 10Gb NICs in Hyper-V, you should be able to explain how you use multiple virtual adapters and QoS to maintain traffic behavior.  If you’re installing XenDesktop using PVS and deploying the virtual machines on vSphere, you should be able to explain how VMXNET3 offers lower overhead by being paravirtualized compared to the E1000.

There is nothing wrong with using best practices.  But if you can’t explain the why, you’re just using them as a crutch.  Then you’re at risk of someone kicking that crutch out from underneath you, either accidentally or purposefully, and you end up looking like this guy

Remember, a “best practice” is just the most common option.  If it was the only valid option, it wouldn’t be a best practice: that’s what they call a requirement.

When the basics just won’t do

The Problem

Things break in IT.  This is a fact, just something you learn to live with.  Often figuring out what (and getting details about how badly it’s broken) is as simple as reading the appropriate log file or Windows event log.

In the case of Xen Desktop, I ran into an issue that I think is worth sharing.  There is a client I interact with on a regular basis, and they were running into an issue.  Their primary MCS random pool had sessions that worked fine normally, but occasionally users that had used a session which had been running fine for hours would fail to reconnect after a purposeful disconnect.  While a lot of potential items were in play (end point, hyper visor, XD code, etc) nothing lined up properly and there didn’t appear to be a any purposeful repeat-ability.  The only particular commonality seemed to be that the issue only affected re-connections, not initial session start up.  Some days, no issues would pop up, other days it would occur to more than one person.  In a word: wierd.

After talking to Citrix support, they essentially didn’t see anything wrong with the configuration (not a huge shock) and decided it sounded like a network problem.  As seems to be the usual outcome in these cases, they were innocent until proven guilty.  If we couldn’t catch it live so they could run a CDF trace… basically you’re on your own. Since this often affected overnight users when no one was around… Yay!

In the past I’d read through http://blogs.citrix.com/2012/07/23/troubleshooting-xendesktop-brokering-process-2/ but never had a chance to try it out.  Since nothing overly useful was visible in the logs normally, this seemed worth a shot.  If I browsed through the connection broker log using powershell, I’d often find a ‘ConnectionTimeout’ entry that accompanied the failed reconnect, but not explanation as to why. Continue reading

Creating the tenant

Creating the tenant

In the previous steps, a tenant cloud was created and the resources exist to be assigned, but the tenant user provisioning hasn’t occured yet.

The basic steps we’re going to take are

  1. Create some basic VMs to convert to template VHDX
  2. Create appropriate VM templates to size the machine and allow the user to deploy from it
  3. Create a user role appropriate for this tenant and grant it access to resources

Create the template components

The rapid needs user will need a windows 8.1 and 2012 R2 template made available.  In the prior steps the ISO was simply copied around to the systems for installation, at this point we’d like to take advantage of sharing the ISOs from the VMM library.

First, each hyper-v host needs to be delegated access to the VMM library servers for CIFS.  This is done through Active Directory on the properties of the computer object

Our VMM service is already (from the way we went through the installer) running as a domain service account, so we don’t need to change that.  Finally, all they hyper-v computer accounts were put into a group and granted read access to the VMM library shares.

Using VMM, 2 template machines were created with basic specs, and windows installed on the VM.  The VM was shutdown at the configuration screen prior to making any local changes.  A new VM template was created for each and the sample VM was destroyed when it got imported into VMM as a template.
vm templates

For these tenant specific systems, the configuration was tweaked to pre-specify the network they can see and the hyper-v compatibility flag was set so it will only be provisioned on hyper-v hosts.
template properties

Create the user role

Finally, we’re ready to create the user role.  A group was created to assign rights through
user role 01

Access was scoped specifically to the cloud in question and a quota established.
user role 02
user role 03

Next the specific network that the tenant is being restricted to is allocated and the pre-existing resources in the cloud (including some VMs that were made manually earlier) are specified as resources the user can interact with.
user role 04
user role 05

Finally we specify permissions specific to this particular cloud (global has none specified).
user role 06

The tenant administrator account is now able to login and provision a copy of either template.  The templates as-is do not auto configure, install any roles, etc but that’s fine in this instance.  Smarter templates and more full fledged services will be created down the road to replace the basic templates.