Solving the Cloud Resource Management Problem with IT Automation

Businesses who are using private or public clouds are struggling to manage cloud resources that need to be spun up and down depending on dynamically changing business needs. If you've ever forgotten to turn your air conditioning off while you were away on vacation and gotten hit with a surprise power bill at the end of the month—you can understand the issue of forgetting to spin down virtual resources. Leaving your cloud resources spun up, like leaving your air conditioning on, can create massive costs for your organization.

To put this into perspective, when speaking with an IT Director at a conference recently, she told us that her organization spent $15,000 more than they had budgeted in just one month on its public cloud because of machines that were never spun down.

While the potential for accidentally incurring costs poses a drawback to cloud computing, the benefits of cloud computing for the modern business are extensive. Improved accessibility and scalability to allow organizations to easier adapt to changing business needs, and the flexibility of having resources available in the amount you need them, when you need them is indispensable. According to a joint survey of 1,300 companies in the U.S. and U.K. by the Manchester Business School, Vanson Bourne, and Rackspace, 88 percent of cloud users reported costs savings and 56 percent of respondents said cloud services helped them boost profits.

The Challenge

Despite the benefits of the cloud, managing and monitoring virtual and cloud resources poses a significant challenge for businesses.

Example 1

A telecommunications service provider's accounting department may have many file processes they need to run in order to reconcile billing before the end of the month. Their file batch run is scheduled to start at 2:00pm and to run the necessary programs and processes, they are going to need to spin up multiple machines. At the same time, their IT department has a critical SLA requiring a large amount of computing resources that needs to be met by 3:00pm. Without any automated management tools in place, this puts the business in a precarious position. Given the shared infrastructure of the cloud, it becomes very easy for departments to have diametrically opposed computing needs that come into conflict.

When this happens, one of two things could happen. One, the business may not realize the contention over the computing resources, resulting in frequently missed SLAs and other deadlines. Or alternatively, the business may become aware of the contention for computing resources and rely on manually spinning up and spinning down resources to try and successfully execute processes. This second option presents another difficulty in the form of human error, users could forget to spin down, or turn off, the instance altogether. This instance would continue incurring costs until someone spun the machine down.

With strictly on-premise computing, IT departments must worry about things like running out of storage space, having enough air conditioning units, and high electricity bills for their physical machines. As businesses grow, IT departments have to scale to match that growth, usually with an increase in physical machines and the resources to maintain those machines. On the other hand, when businesses shrink, or offices change size or location, physical machines become inflexible—as it is both expensive and difficult to ship, set up, or store physical machines.

Virtual/cloud resources effectively solved these problems by making it extremely easy to scale up or down in a matter of seconds. Despite this, cloud/virtual resources did not solve the problem of resource mismanagement. The reality is that too often people applied the same management principles to virtual machines as they did with physical machines, winding up with virtual machine sprawl. VM sprawl is the proliferation of VMs that are left running and forgotten, without providing any benefit to the business. Virtual machine sprawl can oftentimes be worse than physical sprawl, because unlike the visible nature of physical machines cluttering the office or a server room, virtual machines are "invisible resources" that you never actually see or touch. And when VM sprawl goes unchecked, the drain on finite resources becomes an ever-expanding sink hole for the business.

Because cloud resources are not infinite, IT departments must make choices on how to use the resources they have, to accomplish both day to day operations as well as long term projects. Idle resources add up over time, resulting in higher costs to the business as well as missed SLAs and critical deadline failures that impact the bottom line. And when these resources are eventually depleted, IT is forced to make a problematic decision between buying more resources or simply doing less.

Example 2

An e-commerce retailer needs to make sure that critical business transactions are processed on time and without error. Since the nature of today's business world is 24/7, any production outage can mean a huge loss of revenue. E-commerce retailers often experience very dramatic peaks and valleys in demand throughout different times of the day or year. For a typical e-commerce organization, the ultimate peak is going to be either Black Friday or Cyber Monday when transaction processing and traffic to the website are both at a critical high. To deal with this kind of peak, the IT organization would probably turn to virtual/cloud machines to avoid buying servers that would just sit around during the other low-traffic days of the year. Cloud computing gives organizations the elasticity they need to meet their peak demands without having to buy and store infrastructure that will often remain idle.

But just because the organization has these virtual and cloud machines does not mean they are out of the woods yet. Even with this extra computing power, the IT organization needs to balance resources to ensure critical business workflows for transaction processing were executed as well as routine processes such as database imports or file transfers were completed. Without any automated management or monitoring, it's very easy for the e-commerce organization to slip into a situation where critical workflows and jobs are delayed or fail because resources are mismatched or spread too thin.

"Line-of-business leaders everywhere are bypassing IT departments to get applications from the cloud (also known as software as a service, or SaaS) and paying for them like they would a magazine subscription. And when the service is no longer required, they can cancel that subscription with no equipment left unused in the corner," says Gartner VP & Fellow Daryl Plummer.

The Solution: Resource Provisioning Through Automation

Example:

With Workload Automation for virtual/cloud computing, you can better handle the peaks and valleys by optimizing resource usage across the organization. For example, just as a smart thermostat system raises and lowers the temperature in your house according to your activities, an intelligent automation solution can spin up more machines when more computing power is needed to complete jobs, and spin down machines during idle times. Instead of experiencing low job success rates or having to rely on time-consuming manual management, automation seamlessly adds and removes computing power to match computing supply with business demand. The ability to automatically provision resources and deprovision capacity means unexpected demands can be accommodated and companies don't have to spend money on idle assets.

Innovative tools like Smart Queue and Managed Queue (components of ActiveBatch IT Automation) are pushing the boundaries of traditional Workload Automation by providing a way for organizations to provision and deprovision resources across virtual and cloud systems. This form of machine learning means organizations can accommodate unexpected demands on resources as well as minimize costs of idle systems. Users can set broad infrastructure parameters that automatically create and utilize resources as needed in order to ensure the reliable execution of tasks and workflows.

These out of the box automation capabilities are allowing organizations to:

  • Optimize IT spending
  • Reduce the need for manual intervention
  • Minimize the risk of SLA and deadline breaches
  • Ensure workflows have access to computing resources when needed
  • Provision servers based on past usage and forecasted workloads
  • And more…

Simplify cloud computing with a single point of control and prebuilt integrations for virtual/cloud providers.

Here's How:

To understand how an IT Automation solution can introduce these benefits, it's important to first introduce the idea of ActiveBatch Queues. There are four main types of Queues:

  • Execution Queue
  • Generic Queue
  • Managed Queue
  • Smart Queue

Execution Queue

In ActiveBatch, the Execution Queue is the path to the machine where the ActiveBatch Agent is installed. The Execution Queue points to a specific system where the job will run. Another type of queue, the Generic Queue, consists of one or more Execution Queues and acts like a virtual queue so that any job you assign to run on the Generic Queue can also run on any one of the Execution Queues associated with this Generic Queue.

Generic Queue

The purpose of the Generic Queue is to make available more than one machine that jobs can run on so that in the event that a system goes down, the surviving systems can pick up the load to ensure the job is successfully dispatched. Additionally, the Generic Queue is useful in getting jobs up and running as soon as they trigger. By using the workload balancing algorithm, if a server has high memory or CPU utilization, ActiveBatch will send jobs to other systems in the Generic Queue where greater resources are available, thereby increasing the probability of job success. If a machine has maxed out (reached its job limit), the Generic Queue would allow the job to run on another available machine so the job doesn't have to wait in line for the maxed out machine to be ready.

Managed Queue

In order to get the best return on your cloud/virtual machine investment, you must optimize resources you are using and minimize or eliminate idle resources. Setting parameters on requested machines and retention period as well as identifying which provider and instances you wish to run workflows on creates a good starting framework. Managed Queue is a sub-set of Generic Queue which allows users to set parameters for Smart Queue to then dynamically provision virtual and cloud machines. For example, a user could establish the number of machines in the Managed Queue by setting the "Requested Machines" property to the amount of Amazon instances you wanted to create.

Smart Queue

After you have a parameter framework set, you can utilize intelligent automation like Smart Queue to further enhance your resource management strategy. Smart Queue is an advanced capability that builds off of the settings created in Managed Queue. Smart Queue allows users to increase or decrease the headroom for VMs/cloud systems by setting a minimum and maximum on the number of systems allowed. "Queue Idle Time" tackles the issue of invisible resources running in the background by setting a time frame for when idle systems are spun down. In addition to this, settings such as "VM Lookahead Time" can be used to automate the provisioning systems in advance.

Similar to how a smart thermostat learns home occupants' comfort temperature as well as the times occupants are home to adjust the temperature accordingly, Smart Queue uses historic and predictive workload analysis in order to match computing resources to business needs. For example, if a user has a plan or multiple jobs that are running longer than expected, Smart Queue can automatically spin up other machines to help that job or plan complete in the targeted time frame.

The SLA critical path priority can be compared to the way in which Presidential motorcades are directed through the streets of a city. Instead of closing down every single street and causing massive traffic woes for the public, Security calculates when the President's car is coming through and closes intersections five to ten minutes before the car will pass. This ensures the Motorcade gets to its destination safely and on time, while minimizing the inconvenience to drivers.

Automation is helping to streamline IT processes and reduce business costs by not only managing the execution of workflows but also managing and monitoring the resources that go into making those workflows run.

Three Steps to Better Cloud Management

Take Stock of Resources

Today's IT environment contains a complex array of heterogeneous applications, databases, and platforms. When using the cloud, organizations need a strategy for the amount of resources required and how the resources allotted or used can be most efficiently utilized.

Automated Management

One of the biggest obstacles to better resource management is reducing the need for manual intervention. Tools like Managed Queue and Smart Queue utilize scheduling analytics to automatically provision servers based on past usage and forecasted workloads. Analytics can pull data from past workflow performance, resource availability and capacity, and upcoming schedules to make predictions about possible execution times and durations of workflows, thereby reducing time spent on manual intervention.

Resource Parameters

In order to ensure resources are efficiently allocated to specific processes so that SLAs are not breached, organizations need a way to monitor resource usage. Setting parameters on maximum and minimum VMs, configuring look-ahead times for VMs to spin up, and stopping VMs when they are idle and no longer needed for jobs are just some of the ways IT Automation is helping organizations manage their resources. Additionally, creating a framework of alerts & notifications can assist IT organizations in gaining greater control and visibility over their resource usage. With effective cloud resource management, businesses can achieve a more flexible IT infrastructure that is better able to respond to dynamic business needs while minimizing the risk of resource waste.

Why Automation?

According to research by Gartner, most large organizations have more than three Workload Automation tools implemented in their environment. With the ever-expanding array of software and technologies most businesses use today, managing these different automation tools just adds to the complexity. Instead of moving forward faster with new and innovative technologies, IT is held back by a disjointed automation framework that is brittle and unyielding to change. What is needed is an automation solution that can provide a single point of control for managing these disparate technologies and bridging the gaps between them.

Key Components of an Intelligent Workload Automation Solution

  • Broad Framework for Parameters
  • "Smart Learning" Capabilities
  • Reactive and Predictive Analytics
  • Robust Event Architecture
  • Automated Management of Machine Idle Time
  • Support for Virtualization Systems like VMware or Hyper-V and cloud offerings

ActiveBatch®

An Architectural, Layered Approach to Automation

Relying on scripting or platform-specific scheduling systems builds a fragmented elemental approach that lacks governance and is neither scalable nor designed to accommodate change. In order to remain competitive in the industry and manage multiple systems and technologies, a modern architectural strategy is needed. ActiveBatch provides organizations with an architectural approach to IT Automation that consolidates silos of automation within a single framework, giving businesses the IT agility they need in today's technology-driven world. ActiveBatch is Redefining IT Automation with its innovative Integrated Jobs Library, which provides hundreds of templated Job Steps for key technologies and applications like SAP, Informatica, Microsoft System Center Suite, and more. The templated, drag-and-drop Job Steps can be assembled into powerful workflows, including workflow logic, which shorten implementation time and reduce the risk of scripting errors. Modern IT Automation solutions like ActiveBatch recognize most organizations have used legacy schedulers in the past, which heavily rely on a script-based approach. ActiveBatch protects your script investment with a script vault, or content library, that gives users one central location for storing and editing all existing and future scripts. Additionally, capabilities like lifecycle management and revision history give users greater control and visibility by showing them who last edited the script and when, while also providing an option to restore to previous versions.