Network-heavy, scale-out workloads place considerable stress on today's rapidly growing virtualized and cloud datacenters. Emerging software-defined networking (SDN) technologies aim to mitigate the potential bottlenecks that can precipitate at the network layer in such environments, as well as simplify network design.
Despite advances in both hardware and software, there remain significant challenges, as well as optimization opportunities, at the network layer. Specifically, incorporating topological and traffic-matrix awareness to VM placement decisions holds tremendous promise for increasing application performance.
This whitepaper explores the benefits of incorporating network topology and traffic-matrix information in VM placement decisions, as well as the challenges preventing organizations from doing so today.
Each virtualization wave has upended traditional best practices surrounding physical infrastructure. The mobility of workloads, brought forth by server and storage virtualization, taxes local and storage networks in ways that bare metal non-distributed application delivery simply did not. This new wave has presented an inevitable threat to performance.
Software-Defined Networking (SDN) technologies such as VMware NSX and Nuage Networks have introduced a SDN control plane from which virtual networks can be provisioned, configured, and secured with the same ease as provisioning and configuring a virtual machine.
SDN has delivered tremendous value to organizations by reducing time-to-provision, however, as with the provisioning of VMs, the ability to provision does not equate to the ability to control.
Network-heavy, scale-out workloads that typify today's distributed virtualized and cloud environments still present significant challenges for both Virtualization and Network Architects, as well as for the environments they design. When designing topologies, Architects must consider how to best achieve each of the following:
In truth, these problems are complex and are related less to a priori network design and more to ad hoc network intelligence and how workloads utilize it. Specifically, how can network intelligence inform VM placement decisions such that latency is minimized and performance is amplified?
Traditional 3-Tier networks – Core, Distribution, & Access – are suboptimal for today's heavily virtualized environments because virtualized traffic matrices have shifted from predominantly North-South traffic to mainly East-West traffic. This traffic is comprised of both data transfer between interdependent VMs comprising multi-tier workloads, and virtual machine migrations (vMotions).
2-Tier network designs collapse the Core and Distribution Tiers into a single Spine Tier, yielding the 2-Tier Spine-Leaf configuration illustrated below. Furthermore, 2-Tier networks feature ultrafast Ethernet and switching at the Spine accommodating rates of 40Gbps up to 100Gbps. Leaf devices are typically cheaper commodity devices with lower speeds.
On a 3-Tier Network, workload data must often travel all the way up the network topology, only to travel back down to its Leaf destination, introducing latency. Every additional switch "hop" also introduces potential packet loss. These performance threats explain the observed trend away from legacy 3-Tier Networks and toward more appropriate 2-Tier Networks.
Organizations reluctant to transition from legacy networks to 2-Tier designs are often justified by the high cost of re-architecting. Though the tradeoff between investing in re-architecting and maintaining a legacy network is one that must be determined by your organization, the reality of increased East-West traffic presents a real challenge for all.
Even if the Spine is built of fast non-blocking devices with enough capacity, delivering the last mile of bandwidth through Leaf switching may reveal unexpected bottlenecks. More often than not, the actions related to network issues often lie outside of the network domain.
Erickson et al. (2014) demonstrated that by introducing increasing degrees of network awareness to VM placement decisions, application performance (as measured by throughput or completion cycle) could improve by as much as
70% compared to random placement. Though seemingly intuitive, this work proves that incorporating network considerations into VM placement decisions is deserving of attention.
Consider a typical 3-tier application, the cornerstone of many Web services. This workload consists of three distinct components, each residing on its own VM:
The Web tier receives requests from end users, invoking some logic within the application server, which in order to fulfill the incoming requests performs SQL queries on the DB. The output of the query is pushed back through the application server to the Web server, presenting itself in a clean format to end users. The amount of data exchanged between tiers depends on the nature of the application and the end-user request, which can be very taxing at peak times.
Our 3-tier Web application example, while common, is very simple. In practice, distributed workloads can consist of dozens of tiers, and scaled horizontally, hundreds or even thousands of VMs working in unison.
Commonly used workload distribution tools such as VMware® DRS and Citrix® XenServer® Workload Balancer push these workloads apart, across the local network, which often introduces latency. To mitigate this latency, Architects employ common network localization tactics: dedicated clustering and affinity rules.
Dedicated clustering is the practice of confining application tiers – most often databases – to a dedicated, low-density, highperformance cluster. Consider the following situation. In order to guarantee DB performance, administrators create separate DB clusters where a small number of high performing database
VMs run on powerful servers and fast storage devices. Since the DB VMs are isolated, performance is stellar and protected from interference. The same could be done with application servers, and then a special cluster could run load-balanced web servers whose numbers can scale appropriately with demand.
This option bears a significant flaw. When demand peaks, and a large number of clustered Web server VMs begin sending a lot of traffic to an app server, their immediately-connected network device can become saturated. Additionally, the app server queries pass through the network device on the DB cluster, which also becomes saturated. Although the Spine has plenty of capacity, the application will experience high latencies as slow Leaf switches struggle to accommodate the traffic.
Affinity rules are settings that establish a relationship between two or more VMs and hosts. In the 3-tier application example, we could define a simple affinity rule which requires each application tier VM (Web, Application, Database) to always reside on the same host. This strategy would leverage virtual switches (vSwitches) to constrain communication between each application tier to the host itself.
If frequently-talking VMs run on the same host, their packets will never cross the host boundary and will avoid going to LAN switches at all. While this solution seems convenient, it also bears inevitable pitfalls – especially at scale.
First, this approach requires a priori knowledge of VM communication pathways. With perfect information, the administrator may define and build an entire inventory of affinity rules across the environment. However, large organizations run hundreds of applications.
Furthermore, VM communication behavior is transient. VM-A may communicate with VM-B heavily for some duration, but then switch from VM-B to VM-C or VM-B to VM-E and VM-F for an extended period after that. Even an affinity rule that groups these machines together creates unnecessary constraints on the datacenter, segmenting it and detracting from the dynamic capabilities virtualization offers in the first place.
Is manual affinity definition truly scalable? The second pitfall of affinity rules yields a simple answer: Not really. Isolating VMs in this fashion can work, permitting that demand on the workload does not overwhelm the assigned compute, storage, and internal network. As soon as demand on the workload peaks, the compute and storage infrastructure forces you to separate each tier of the load as far as possible from one other onto separate devices.
Once the tiers are separated, the potential for network latency is reintroduced. As is evident, static affinity rules seem to be a solution, but cannot truly scale for performance. Conversely, dedicated clustering accepts that under peak times, latency is inevitable.
These two tactics represent opposing strategies, neither of which is resilient under peak demand conditions. Dedicated clustering dictates that it is best to provide application tiers with lush compute and storage resources, at the risk of overloading the network. Affinity rules avoid traversing the network at the risk of overwhelming local compute and storage.
Which is better – latency due to CPU ready queues and storage I/O, or that due to congested Leaf switches? Neither. The goal is to minimize all of it. The challenge is that doing so in an environment of hundreds of applications—each with dozens of tiers, distributed across thousands of VMs sharing hundreds of hosts, data stores, and network devices—is a very difficult problem to solve. When one considers the unpredictability of demand, the difficulty is amplified. Is there a solution?
There is no static solution that can effectively control this tradeoff. Heeding the work of Erickson, as well as the challenges described herein, it is evident that latency can be minimized if and only if VM placement decisions are made with complete knowledge of the following:
Turbonomic's patented algorithm abstracts the datacenter and all its networked availability zones as a market of resource buyers and sellers. The Economic Scheduling Engine maps the end to end relationships between discrete resources in the IT stack with a holistic understanding of the supply chain from physical resource supply to end user service delivery. Workloads self-manage in real time, shopping for the best overall price of all the resources they need to perform, resulting in automated actions that continually optimize performance and efficiency. This vendor agnostic, extensible algorithm easily accommodates new entities – for example, containers – because as they arise, Turbonomic simply places the new entity within its existing supply chain.
Turbonomic has introduced two new entities into its Market algorithm to control this complex tradeoff. Leveraging Flow Collector output, Turbonomic discovers the network traffic-matrix and dynamically defines each group of "chatty" communicating VMs as an entity called a vPod.
In the previously illustrated traffic matrices, each grouping of communicating VMs would be grouped as a vPod. VMs 1, 3, 5, 7, and 9 would be one vPod, and VMs 2 and 6 would be another vPod. VM 4 and VM 8 are not a vPod, as they are not communicating at the moment.
The construct of vPod eliminates the need to inventory static affinity rules, and provides maximum flexibility of VM migrations when communication between vPod members is low. When demand on a vPod is high, the vPod migrates as a unit, consuming resources from an entity called a dPod.
A dPod is a set of resource providers located close together (physically) on the network – i.e. a group of hosts and storage residing under the same TOR Leaf switch.
Using topological probabilities, Turbonomic defines four levels of flow, each of which is increasingly more expensive than that before it within the Market:
Intra-dPod Flow (Host to Host Migration)
Cross-dPod Flow (Cluster to Cluster Migration)
Cross-Cloud Flow (Private to Public Migration)
By pricing higher level flows more expensively, the Market forces vPods to converge locally unless no better options are available. The purchase evaluates both the flow and migration to the destination dPod, as well as a full consideration of the resource set (CPU, memory, storage, ballooning, swap state, ready queuing, etc.) that real time workload demand requires for consumption.
Dynamically defined vPods self- manage the tradeonetwork, compute, and storage – ff between migrating themselves to the most economic dPod which will simultaneously maximize workload performance and resource utilization.
When communication between vPod members subsides, the vPod disaggregates until demand drives it back together. By dynamically localizing workload flows, the desired tradeoff is always attained, where application performance is assured while physical resources are consumed as efficiently as possible.
Network-heavy, scale-out workloads typical in today's virtual and cloud environments have driven a rapid increase in EastWest network traffic. This traffic consists of both VM migration data, as well as communication data passing between multi-tier application VMs. Static management tactics such as dedicated clustering and affinity rules force organizations to choose between network latency and compute/storage latency, a tradeoff which cannot be controlled statically.