Disaster recovery: ZeRT0 vision for a virtual world

Several times in my career I had to face a system issue that needed a “time machine” to solve. Usually a restore from backup was enough, whereas the backup was scheduled. In other situations, if and when the company I worked for could afford a DR traditional solution, I was lucky to rebuild the system from its mirror. Many times I had to rebuild the whole system, loosing data, usually not so critical to be backed up.

restore-red-key-keyboard-595x335.jpg

Then virtualization came. Snapshots were a godsend, the main thing was to schedule them, no matter if they could have an impact on performances, the pros were far superior than cons.

After some time, when virtualization became important and not only an exception,  snapshots started to be annoying. In the last years, having a backup was no more enough for a critical system in production – too many hours could be lost. And snapshots weren’t anymore adequate because or they were catched frequently – but with performances impact – or occasionally, between a backup and another – which in this case lots of data could be lost.

At 2012 VMworld I visited ZeRT0‘s booth. I was amazed by their solution, because it was almost syncronous, no snapshot based, no need of an alternative datacenter (either hot or warm) but, overall, hypervisor based.

master-of-disaster.jpg

Through these years the application evolved ’till covering, today, even hypervisor-hybrid environment. So nomore just HW agnostic, but even Hypervisor agnostic.

We latey upgraded to the last version, 4.0, it’s a major release.

Before reviewing the last version, I’ve to say that the upgrade process took very few time and, overall, so few operations – it was a real easy walk even in a complex environment like ours.

First of all, the architecture. We’ll talk of VRA (Virtual Replication Appliance), ZVM (Zert0 Virtual Manager), ZCM (Zert0 Cloud Manager) and VPG (Virtual Protection Group), just to name some acronyms.

At the very low layer are the VRAs: little linux appliances rooted one for a hypervisor that take care of every single VM, all its pointers, all its network data, where its storage is, etc.

Just upper lay the ZVM: it manages all the VRAs, transposes user’s setting via an intuitive (and nice) GUI, and connect to the recovery site pair ZVM.

zerto.png

In the case of Cloud providers, or simply in environments where more than a ZVM is needed, you’ll have to install a ZCM: an orchestrator for the ZVMs, but, especially, a tool that offers cloud multitenancy, exposing a Self Service interface for the end user (ZSSP) and that deals with Connector installation – a proxy that acts as a ZVM pair for the customer, but that allows a complete privacy showing only its organization and separating all the other customers’ ones.

zssp.png

Some of the benefit experienced in these years, and not only “marketing words”:

  • RTO of minutes (since the name of ZeRT0 – Zero RTO), RPO of secons
  • Native multitenant architecture
  • HW agnostic
  • Full vCloud Director compatibility and integration
  • Zero impact on production VMs
  • Full automation of failover, failback and test
  • Partial recovery

The main difference I fond with the other solutions in the market was the RTO/RPO timing (very low to call it not only DR but BC too), and no usage of snapshots thanks to a journaling file where all the modifications written on a protected vDisk are reported to the replicated one and consolidated one by one after the chosen history time.

Compared to a traditional solution, well, in this case the difference is even wider, from the possibility to recover a single VM and not all the storage volume, no need of mirrored DC, and the consequent cost.

Now, what this solution misses from my point of view.

  • From version 3.5 ZeRT0 implemented the extended recovery: it’s a middle way between DR and backup. I’d like to reduce the number of vendors for many reasons, so if backup could be covered by ZeRT0 I’ll be satisfied. But today we can’t use it for backup for 2 main reasons: it doesn’t dedupe, and restore from backup isn’t integrated in vCloud Director like, instead, recovery from DR is. So customer won’t be autonomous.
  • From a CSP point of view, another missing service is the possibility to connect to customers residing on other CSP.
  • It is a only-virtual solution. But this is not a limit for our needs.
  • It does not replicate IDE disks – some virtual network devices are based on IDE disks – Watchguard is one of them.
  • It doesn’t replicate vShield Edge rules, either org ones and vApp ones. The first case could be solved replicating by hand that rules before any disaster should occur, but the vApp ones cannot since the vApp is created only when a failover starts.

My opinion, anyway, is that this is a real enterprise solution, where the benefits are far more than the disadvantages. A particular mention goes to the Support guys. They’re highly professional, responsibles at any hour of the day (and the night) and they won’t leave you, if a WebEx is needed, ’till the issue is solved, no sentences like “I’ll call you after my senior engineer will evaluate the issue”, that senior engineer will be ready online for your issue.

Be aware, anyway, that ZeRT0 is a great tool, but it’s a TOOL, not a DR Plan. You need a plan to decide in an objective way when to declare a disaster and which are the procedures to be taken. Inside the plan you’ll insert ZeRT0 as part of it. Don’t make this mistake as some of our customers did.

I’m sure I forgot some important feature, I hope ZeRT0 will forgive me. Below the comments section will welcome any suggestion and correction.

V2D: Vmware Validated Designs

In the past days I’ve been wondering how hard could be building from scratch a new datacenter. I was thinking at the lower layer, hardware, but maybe more at the upper layer, virtualization platform and over it, all the related services.


I supposed that the HW part could be more or less easily addressed by lots of models, from vendors to best practices since it’s a relatively static environment – yes, everyday there are new technologies, but often are improvements of the existing ones – but the software layer was for me the real challenge. All the implications each other in terms of compatibility between all the versions made me think that this could be the hardest part to accomplish.

Assuming VMware as virtualization platform, HW isn’t hard to choose, as long as it’s compliant to HCL.
I’m talking only if the case is the choose of a “build your own” solution. It doesn’t apply in case of converged or hyperconverged infrastructures.
When all the servers, storage, network devices are cabled together, the real question is: I’d like to have the last version of vSphere. And even this is quite simple. The difficult will be after: which version of the side components will be compatible? Before we had matrices to look at, several ones, and the only way to be sure it worked was to set up a test plant.

Picture2 (3)

Then the guys at VMware had an idea: set up these test plant for us. And V2D was born.
Once you – or your vendor – or your host – complete the hardware layer, you can pick up a version – last, but not necessarily – of vSphere and, thanks to VMware Software Manager, a free tool, you’ll have as output all the other related services versions that will be compatible. It doesn’t simply read and elaborate all the previous matrices for you, but, throught a nice GUI, provides the list and the download links, together with any new release detection and, last but not least, integrity check, usually skipped.

But this is only the first step, just an automation. The real topic of this post is the availability of templates, blue prints, called V2D.
These are widely exposed by several sessions at last VMworld (SDDC5440, SDDC5609), and at the Italian VMUG session by Andrea Siviero, in addition to a post from Kelly Dare. A nice video can be also viewed here.

These designs are the result of VMware test plants assuring that all the operations, architecture and applications version are compatible and correctly working each other.
Let’s thing at designs like models, core structures. The core is designed, over the core you can choose to apply services, not necessarily VMware-branded (I’m thinking at Zerto for DR, for example, or at Veeam for backup), applied following that alternative vendors setup guides.

Currently here are two of these templates:

It’s a kind of sentence from VMware like “if you follow this template, you have my assurance that it works, and it works well”.

That’s not all: whenever you should need an update or upgrade, the same designs will update too, so that you’ll have the same compliance if you follow that architecture.

This told, being helped by a VMware engineer isn’t a bad idea. You’ll have the double check – design and removal of any mistake you could make. But you don’t need a full team as you did before, moreover you can upgrade/update with no help keeping the VMware compliance.

I will come back to this topic since it’s so wide to be covered by a single post. But just to pin out some useful links, by now.

Tintri: the VM-Aware Storage

We’re used to think at storage like some monolithic entities, black, inaccessible (only support can), and managed by strong specialists.
The Tintri concept is somehow different.

/content/dam/ready/partners/ti/tintri/tintri-vmstore/Tintri_VM Image.png Let’s start saying that this is a virtual-only storage: don’t even think at it for physical purposes, it will lose all its charm.
I’ll talk about the hybrid solution, T880 is the top device among Tintri’s hybrids.

 What’s different from others: the magic inside the (white – posh!) box. Actually, that’s simply a mix of SAS and SSD disks, so? So, the software is the difference.
Some of you could say “Well, so why have a complete box instead of simply a software”. Yes, it makes sense. The reason why they decided to join HW and SW is “compliance”. Offering their own mix of HW with their SW let them say exactly which the performances will be, that there are no conflicts inside and, last but not least, a single point of support.
Let’s have a look at the benefits.
I think that the main benefit is performance. Anyone used to traditional storage will be amazed. Latencies and IOPS are something that traditional storage can’t afford.
Setup – to set a box up in a rack will take no more than 30 minutes, its console configuration 5, its remote configuration 10. In less than 1 hour you’ll have your new datastore up&running.
Dedup and compression – somehow variable between 1,5 and 2 in practical world, and it is performed in-line with no loss of performances.
QoS – one of the reasons for the title. Tintri allows QoS per single VM, it means that if you need just a VM inside a group to be fast, is enough to work directly on it, should you decide that speed is no more necessary, again enough move the QoS crossbar. This also means having VIP customer and new marketing tools, if you’re a CSP like the company I work for.

 


TGC, a.k.a. Tintri Global Centre, an orchestrator of all the features of all the VMstore installed. It’s not only a single panel to control the storage, but an instrument for better global analytics, an overview on all the VMs and global performances of the whole system.
Storage management is unbelievably easier, all the time needed to manage LUNs and inflexible blocks can be allocated elsewhere.
Footprint – a T880 fills 4U, it’s power consumption is about like an hairdryer, need for cooling more or less as a couple of blades.
Now the unpopular items. If you have less than 80% virtualized, it’s not the best choice. If your budget is quite limited, you should look at traditional, even if slower, products. If you need performances, don’t have budget limits and your environment is pretty stable, look at all-flash arrays.

Finally, I’d like to spend some words for the quality of support. During our P.O.C. (which I strongly advice to ask for) we had an issue with a disk, of course the system wasn’t new. I was impressed by the time elapsed from ticket being opened, to call and web-ex received, and solution.

All in a flash, as the name “Tintri” in old Irish stands for. Good job guys, and “keep it simple”!

Hypervisor convergence: Nutanix paradigm

This morning I had the pleasure to attend a general demo from Alberto Filisetti about what Nutanix offers in the hyperconvergence market.
Nutanix based its datacenter on “building blocks”, each of them includes simply some servers and some storage, both standard, all managed by a “magic” software.
Their IT team had the idea to export this model on the enterprise virtual market. This idea was so successful that they were able to gain about 400 million dollar in 4 round fundings, and they’re going to be public in short time.
All this happened in USA. Nutanix landed in Europe in 2012, and suddenly they were successful. The VMware VSAN concept was in an embrional state inside Nutanix almost 5 years before VMware (note that EMC was the main shareholder)
Today Nutanix holds more than 50% of hyperconverged market, inside Gartner magic quadrand its position is the leaders quadrant – high vision and high execute.
The Nutanix “block” includes very standard computing hardware (Supermicro, Lenovo, Dell), managed by its real added value: the managing, orchestrator software, reachable even by API, and on it Nutanix assures reliability. Maybe the definition “orchestrator” is not completely suitable. Why uses standard hardware: for predictability of performances and to assure support by well-known vendors.
Which was the reason for this choise: latency is high when data reside on non-local storage. Implementing this solution, all the reads and writes happen locally.
The base unit so, it’s called a 2-unit “Block”, containing 2 or 4 servers, or “nodes”. In this way they’re able to become up&running in just 20 minutes after mounting in the rack.
As written above, inside a block you’ll find not only computational units, but even capacitivity power: disks, traditional and SSD aggregate only in one file system, connected by a 10Gbps ToR switch.
On every node runs a Nutanix Controller VM (CVM) that is the real brain of the whole system.  It resides on SSD and hosts the Nutanix software, serving all of the I/O operations for the hypervisor and all VMs running on that host. It’s hypervisor agnostic, his level is higher.
converged_platform
Among all the hypervisors, KVM is one of the less used because of its hard manageablity. Nutanix has a solution in this case too. They developed a kind of hypervisor based on KVM, Acropolis, that simplify its management. Just a note: Acropolis is just an option, it isn’t mandatory for the system.
You’ll find a very good and updated resource for all Nutanix structure at Nutanix Bible.
640x360-bible
All the nodes are independent, from 3 to the infinite and could be added one at a time. Every node doesn’t add only computing (CPU and RAM) but also storage and IO since every of them brings a controller. Compared to an external storage so, these nodes have the advantage to grow in width, where the external ones have the same initial controllers as a funnel.
Geographical and multi direction DR is also included, RPO minimum 1h, or BC, but in this case latency can’t be higher than 5ms, or using a VMware streched cluster.
The main intention of Nutanix is to make the datacenter invisibile: before storage invisible and, at last, the whole virtualization and cloud invisible too.
The first step, hiding storage, is reached thanks to a fabric (Lenovo) distributing the file system and protecting it. at the beginning it was mainly used in VDI solution because of low latencies, today VDI represents only 30% of business, even thanks to applications certifications (just a.e.SAP, Splunk, MS Exchange).
Finally, virtualization and cloud invisibility: with Acropolis as a higher layer over hypervisors, and Prism, that manages all the hypervisors controllers, and even per-VM management, but in this case only for the servers running Acropolis. the wonder of this application is the one-click operation to add any eterogeneous hypervisor!
The same platform is issued in a community edition, of course with no support.
Now, to resume what’s inside the block: independent servers carrying and conected by 2/4 10Gb ports,redundancy ones and management ones. Adding a node means adding resources, computing and storage (3-direction growth). This growth could happen also with different hardware, in order to protect older investments.
Every software upgrade (for bugs solving, for customer requests and suggestions, and because of greater performances (usually from 20 to 80% more) is performed by one click, no interruptions, no vMotions, because upgrades happen in a CVM at a time using redundanced automation. Or manually, uploading an image that is processed by VMware automation.
Let’s have a look to the data flow, a. e. in a 3-nodes system: ESXi writes on traditional local disks and replicates to the neighboors 2 or 3 times according to the set policy, with a very low latency of nanoseconds. If and when a disk fails, metadata rbuild the data with no need of management, no need to wait for rebuild.
Reads take place on SSD, in case of vMotion data are relocalized on the new server. The data locality allows IOPS of 15-20k.
net_io
Conclusions, why should we choose Nutanix:
  1. dimensions, that means low power, low space needed;
  2. time to install;
  3. fast and easy to upgrade;
  4.  upgd veloce,
  5. data protection at 2 or 3 levels, and in case of DR, in a geographic way;
  6. high predicted performances;
  7. hypervisor agnostic, managed by a GUI;
  8. an high specialized and professional support, covering not only the software, but even the base hardware.
Some considerations: should we need more storage and nomore computing, they’ll provide a so called “passive node”: very low CPUs and RAM, with high capacity low performances storage, attached by network to the system
Other hardware could coexist, in his case this hardware wn’t have all the Nutanix software optimization.
Finally, I would underline once more that a cluster can be made with different hypervisors: one KVM, one ESXi, one Hyper-V, all controlled by Prism.
I can’t wait to have a test unit to verify all this!

VCAP or VCIX?

During the last VMUG UserCon in Milan I had the pleasure to
meet Andy Cary , VCI  Program Enablement Lead at VMware.
Since I begun months ago to study for my VCAP-CIA and in the while it “disappeared”, I was wondering which was the best between switch to VCAP-DCV and waiting for VCIX (still DCV since the Cloud path doesn’t include anymore vCloud Director – vRA took its place, and I’m really not confident with it).

vcap vcix

His advise was to go ahead with VCAP for 2 reasons:

  • first of all VCIX will not see the light before new year, and even though, it will pass through a testing/beta period;
  • second, if you pass even just one of the 2 VCAP (DCAdministration or DCDesign) you’ll earn a certification. And this exam will be converted in “half” of VCIX when it will be issued, so it won’t be time wasted, you’ll have to pass just the other exam to earn VCIX.

So, back to my old lab for VCAP-DCA, maybe I’ll invest some money on Ravello Systems. No chance to have anything at home, if I care my marriage, I just could borrow some unused stuff in our Datacenter, but more time wasting.

ravello autolab

I had the chance to test Ravello Systems deploying Autolab, and I must say they’re amazing, I needed support for a little trouble and had it solved in minutes.

Moreover, there are tons of study guides in the net, in my opinion the most complete is the Paul Grevink‘s VDCA550 Study Guide among the free ones plus the Chris Wahl‘s study sheet.

But the imperative is – lab, lab, lab, and don’t forget the blueprint !

VMUG Italia UserCon 2015 feelings

November, 12th the most important VMware Italian users event was held in Milan.

usercon

It’s been the third one, and the most successful of all 3.
I care it especially because I had my first public speech in a so important occasion.
But for sure it wasn’t me the best part. The VMUGIT Team did an awesome work.
They were able to invite people like (I’m following the list from
http://www.vmug.it/):

Chad Sakac – President, Global Systems Engineering at EMC – Blogger at virtualgeek.typepad.com

Vaughn Stewart – VP, Enterprise Architect at Pure Storage – Blogger at vaughnstewart.com

Calvin Zito  – Also known as the “HP Storage Guy” or the “Go to guy for all things HP Storage” – Blogger at Around the Storage Block

Duncan Epping  – Chief Technologist at VMware CTO Office – Storage & Availability – Blogger at Yellow-Bricks.com – VMUGIT UserCon regular guest – “the STAR” by my point of view

Lee Dilworth  – Principal Systems Engineer at VMware

Mariano Maluf – VMUG International President, he’s the man pushing the buttons in the VMUG control room.


I must be honest, I followed few of the sessions scheduled, since I was nervous about mine, but I’d like to report the feelings of the events.

Attendees: all of the ones I talked to felt like attending a really important event. All of them excited for the international speakers. Most of them were there on their own, some taking an holiday from work. I believe they came back home satisfied.

Sponsors: Since I was hosted by one of them, I can report directly that, at least for Tintri, they were inside a tech vault, talking to practical and well prepared professionals. So they expressed their best. Having a walk around for the other booths, anyway, I picked up more or less the same thoughts and behaviours – they knew that UserCon wasn’t the place for small talks and big numbers, but it was a good opportunity to introduce them to the tech heart of every company.

Speakers: I personally feel dazzled when I consider Duncan Epping – see him talking with the same passion everytime is incredible, so maybe I can’t be unbiased. His session reminded me one of his best ones presented at last VMworld, however he can be considered as the natural father of VSAN. Andrea Siviero, another great professional in my mind, explained in a simply way how to design complex scenarios through VMware Validated Design. Me, I will not talk of me, just to say that my passion for Tintri solution is real and evincible by numbers. Unfortunately I wasn’t able to follow other sessions.

Staff: I already knew Giuseppe Guglielmetti, a passionate professional, I already met Andrea Mauro, incredible blogger-reviewer-organizer-consultant-…superman! But it was the first time with the others. I had the confirmation on my beliefs – they are a strong team, and I feel proud to be a member!

Good job guys, keep it up!

Ready – Steady – GO!

goSo, did we really need a new blog and a new blogger? I can’t say, I hope to give a little bit of my professionality to our large virtual community.

And, to start in the best way, I will not bore you readers with useless sentences, and I’ll close this first post in a few words: I’m not english speaking, so please forgive me for any error should I make, and I’ll do my best in keeping this blog interesting. If not – don’t worry to be rude and tell me.

Let’s go!