VMworld 2012: INF-BCO2655: VMware vSphere Fault Tolerance for Multiprocessor Virtual Machines

Session abstract

This session will describe exciting new developments in implementing VMware vSphere® Fault Tolerance (VMware FT) for multiprocessor virtual machines. This new technology allows continuous availability of multiprocessor virtual machines with literally zero downtime and zero data loss, even surviving server failures, while staying completely transparent to the guest software stack, requiring absolutely no configuration of in-guest software. In this technical preview, we will outline the virtues of VMware FT, provide a detailed look at the new technology enabling VMware FT for multiprocessor virtual machines, offer guidance on how to plan and configure your environments to best deploy these capabilities, examine performance data and showcase a live demo of the technology in action.

The session

This technical preview session was about how VMware is planning on how to get Fault Tolerance working on virtual machines with more than 1 vCPU (which is the current limit in ESXi 5.1).

The session started with the fact that VMware can offer protection at every level by using several techniques like High Availability, (Storage) vMotion, (Storage) DRS, Site Recovery Manager (SRM),… but the thing with these techniques is that this might be noticeable. You might have 1 or 2 ping timeouts depending on your environment. With Fault Tolerance on multiprocessor virtual machines VMware wants to lower(/remove) this.

VMware calls this idea “Continous Availability“:

  • Zero downtime.
  • Zero data loss.
  • No loss of TCP connections.
  • Completely transparent to guest software:
    • No dependency on Guest OS applications.
    • No application specific management and learning.

Fault Tolerance was introduced in 2009 with the release of vSphere 4.0, had an upgrade with 4.1 in 2010 and then with vSphere 5.0 they made some more improvements but the problem always was that it was limited to 1 vCPU and therefor many people couldn’t use it on important virtual machines.

The way Fault Tolerance currently works is that they use the vLockstep protocol (keeps both virtual machines in sync), the 1GbE dedicated network for FT logging and at the moment both virtual machines use shared VMDKs which reside on your shared storage.

With the new release on Fault Tolerance all this will change:

  • vLockdown is “dead”: it is now replaced with “SMP FT protocol“.
  • A dedicated 10GbE FT logging network is needed.
  • Both virtual machines now have their own VMDKs.

Basicly when you enable FT on a virtual machine you are creating 2 machines and not just 1 as it is with the current release. The VMDKs will be split on seperate datastores (you can choose which datastore according to the demo) but there is still one shared datastore which they call the “Tie Break Datastore”. This is a backup system when FT logging fails to do the job it’s supposed to do.

After the technical information the presenter showed a demo using a vCenter which had 4 vCPU’s and 16GB of RAM. He showed information using esxtop and then did a reboot on one of the hosts which was running the primary virtual machine. As expected this worked fine and the machine stayed online without a loss of TCP connection.

This was a really great session (and demo) and I am sure when this will be released people will want to use it. The only problem I can see is the use of the 10GbE FT logging network which isn’t something everyone already has.

Niels Engelen
Working as a pre-sales with an interest for anything virtual and cloud. Certified as a VMware Certified Professional 4 and 5. Niels also gained the VMware vExpert award (2012-2016) and is an ex-PernixPro.