Operations Management Suite (OMS): Alert Management

Introduction

OMS provides the capability to generate alerts based on a pre-defined log search. In addition to alert notifications, you are also able to configure remediation tasks for the alerts you have configured.

Not only is OMS equipped with the native Alert and Alert Remediation capabilities, it also provides capabilities to integrate with several external monitoring systems. In this chapter, we will cover the following topics within OMS Log Analytics:

  • 3rd Party Alert Management
  • OMS Alert and Alert Remediation

3rd Party Alert Management

Microsoft Operations Management Suite (OMS) offers a rich set of capabilities when it comes to monitoring and alerting. In Addition to its native alerting capability based on pre-defined search queries, OMS is also capable of collecting alerts from 3rd party monitoring systems such as:

  • Microsoft System Center Operations Manager
  • Nagios
  • Zabbix

Some OMS features, such as the Power BI connector also ship with native alerts. To view all alerts stored in your OMS workspace, you may use a simple search query such as "Type=Alert".

Working with System Center Operations Manager Alerts

OMS Alert Management Solution Overview

The Alert Management Solution in OMS is built specifically for System Center Operations Manager (SCOM). When enabled, it collects the alerts generated by the SCOM management groups that are connected to your OMS workspace.

Powered by search and coupled with the out-of-box views, this solution provides a quick and easy method for SCOM administrators and operators to locate and search related alerts based on specific queries. It can also assist with alert tuning, incident remediation, and root cause analysis. In large organisations where there are multiple SCOM management groups, the Alert Management solution provides a single pane of glass to manipulate alert data across all SCOM management groups residing within your organisation. This scenario would be difficult to achieve in a traditional SCOM environment.

Like other solutions, the Alert Management solution can be located in the OMS Solutions Gallery. From there, you can Add (enable) it to your OMS workspace, as shown in Figure 1.

FIGURE 1. ALERT MANAGEMENT SOLUTION IN SOLUTIONS GALLERY

Once you have successfully onboarded your SCOM management groups onto OMS and have enabled the Alert Management solution, this solution will start collecting all the alerts generated from the onboarded SCOM management groups. A tile titled "Alert Management" will also be added to the Overview page, as shown in Figure 2.

FIGURE 2. ALERT MANAGEMENT TILE IN THE OVERVIEW PAGE

Clicking the "Alert Management" tile from the Overview page will take you to the Alert Management Dashboard page. As shown in Figure 3, this page presents you with a number of views relating to SCOM alerts which are based on your chosen scope and time range. You may further drill down by clicking on any specific areas from within the dashboard.

FIGURE 3. ALERT MANAGEMENT DASHBOARD

By default, the data presented on the Alert Management Dashboard page is based on the last 1 day, with the scope set to "GLOBAL". Both the time range and the scope can be changed here and data on the page will be updated accordingly. The time range and scope settings are shown in Figure 4.

FIGURE 4. ALERT MANAGEMENT SOLUTION TIME RANGE SETTING

When there are two or more SCOM management groups connected to the OMS workspace, the data shown on the Alert Management Dashboard is scoped to all management groups by default (the GLOBAL setting). If you would only like to see data from a particular management group, you can select the management group of your choice using the Scope option, as shown in Figure 5).

FIGURE 5. ALERT MANAGEMENT SOLUTION SCOPE SETTING

Searching Alerts

Similar to most of the other solutions in OMS, the Alert Management solution is powered by search. Users can search alert data using search queries, that begin with Type=Alert.

For example, to search 'All active alerts (logged over the last 7 days)' as shown in Figure 6, use the following query:

Type=Alert AlertState!=Closed SourceSystem=OpsManager

FIGURE 6. SEARCHING ALL ACTIVE SCOM ALERTS

In SCOM, each alert is an instance of the Microsoft.EnterpriseManagement.Monitoring.MonitoringAlert class. The properties of this class are documented on MSDN at https://msdn.microsoft.com/en-us/library/microsoft.enterprisemanagement.monitoring.monitoringalert.aspx

Table 1 below lists the OMS alert data fields and corresponding SCOM MonitoringAlert class properties:

OMS Alert Data Fields

SCOM Alert Object Properties

AlertId

Id

AlertName

Name

AlertDescription

Description

SourceDisplayName

MonitoringObjectDisplayName

SourceFullName

MonitoringObjectFullName

AlertState

ResolutionState

AlertPriority

Priority

AlertSeverity

Severity

ResolvedBy

ResolvedBy

TimeRaised

TimeRaised

TimeLastModified

LastModified

LastModifiedBy

LastModifiedBy

TimeResolved

TimeResolved

TicketId

TicketId

AlertContext

Context

RepeatCount

RepeatCount

ManagementGroupName

ManagementGroup

SourceSystem

TABLE 1. OMS ALERT DATA FIELDS VS. SCOM ALERT OBJECT PROPERTIES

Unlike SCOM, OMS does not implement health models - which are stateful representations of object and application health. OMS only stores and processes alert data. When importing alert data from SCOM, only a subset of alert properties are uploaded into OMS. All the OMS alert data fields listed in Table 1 are searchable and can be used when constructing your alert search queries.

Since OMS is capable of integrating with multiple monitoring systems, for SCOM alerts, the value for the SourceSystem field is always set to "OpsManager".

Note: Searching data in OMS is explained in detail in "Chapter 2: Searching and Presenting Data in Log Analytics" in this book.

OMS also includes a number of common search queries for Alert data and these queries can be seen in Figure 7.

FIGURE 7. BUILT-IN SEARCH QUERIES FOR ALERTS

In addition to the built-in search queries, you may also use some of the following sample queries to start your journey of managing SCOM alerts using OMS.

Alert Raised during the past 1 day, grouped by management groups:

Type=Alert SourceSystem=OpsManager TimeRaised>NOW-1DAY | measure count() as Count by ManagementGroupName

This query (shown in Figure 8) can be used when comparing the overall health of your SCOM management group.

FIGURE 8. SEARCH RESULT FOR 'ALERT RAISED DURING THE PAST 1 DAY GROUPED BY MANAGEMENT GROUPS'

Top 5 Noisiest Alert Source during the past 7 days:

Type=Alert SourceSystem=OpsManager TimeRaised>NOW-7DAY | measure count() as Count by SourceFullName | top 5

A very common requirement for SCOM administrators is to identify the top offenders when it comes to alerts generated. The systems returned from this search (shown in Figure 9) not only consume more resources in SCOM than other systems, they could also be the unhealthiest systems within your environment.

FIGURE 9. SEARCH RESULT FOR 'TOP 5 NOISIEST ALERT SOURCE DURING THE PAST 7 DAYS'

Top 10 Noisiest Alerts during the past 7 days:

Type=Alert SourceSystem=OpsManager TimeRaised>NOW-7DAY | measure count() as Count by AlertName | top 10

This is a useful query to use during alert tuning and for root cause analysis. A sample output of the query is shown in Figure 10.

FIGURE 10. SEARCH RESULT FOR 'TOP 10 NOISIEST ALERTS DURING THE PAST 7 DAYS'

Alert Raised during the past 1 day that do not have Ticket ID assigned:

Type=Alert SourceSystem=OpsManager TimeRaised>NOW-1DAY TicketId!=*

When an automated alert handling process is implemented for SCOM (i.e. using the SCSM alert connector or custom Orchestrator runbooks), SCOM alerts are generally updated with the Ticket ID after the incident has been logged in the ticketing system. Alerts without a value in the TicketID field could indicate an issue with the automated alert handling process that SCOM administrators would need to investigate further. The output of this query is shown in Figure 11.

FIGURE 11. SEARCH RESULT FOR 'ALERT RAISED PAST 1 DAY WITH NO TICKET ID ASSIGNED"

All active alerts with repeat count greater than 0 sorted by repeat count (from high to low):

Type=Alert SourceSystem=OpsManager AlertState !=Closed

RepeatCount >0 | sort RepeatCount desc

This is another useful query for alert tuning and root cause analysis and Figure 12 shows an example of its output.

FIGURE 12. SEARCH RESULT FOR 'ALL ACTIVE ALERTS WITH REPEAT COUNT GREATER THAN 0'

Count Alert ResolvedBy grouped by User Name:

Type=Alert SourceSystem=OpsManager AlertState = Closed | measure count() as Count by ResolvedBy

In SCOM, alert objects have a property called "IsMonitorAlert", which is set to TRUE if the alert is generated by a monitor (as opposed to a rule). This field is not available in OMS alert data. However, by using this query, you can still identify who is closing your SCOM alerts. By clicking on a user ID to drill down further, you may identify some monitorgenerated alerts that have been manually closed by a person rather than automatically resolved by the system when the error condition that generated the alert is corrected.

As SCOM administrators all know, this is a bad practice, because the underlying alert remains in an error state and will never raise another alert unless the monitor is reset or the condition improves and recurs. Additionally, alerts resolved by "Auto-resolve" would indicate SCOM has closed these alerts unresolved because they have not been updated for too long. In Figure 13 you can see that the majority of alerts in our environment have been closed by the 'System' account and a small number have been closed by a specific user account.

FIGURE 13. SEARCH RESULT FOR 'COUNT ALERT RESOLVEDBY GROUPED BY USER NAME"

Populating SCOM Alert Data to OMS

Once the SCOM management group is connected to an OMS workspace, a number of management packs will be downloaded and imported into the SCOM management group automatically. These management packs contain the words "Advisor" or "Intelligence" in their name.

In OMS, the Alert Management solution leverages a rule called "Collect Alert Changes" (located in the "Microsoft System Center Advisor Alert Management" management pack) to collect SCOM alerts and bring them into OMS.

In the exported management pack shown in Figure 14, you can see how the rule is configured to send alert information to a specific SCOM connector.

FIGURE 14. 'COLLECT ALERT CHANGES' RULE CONFIGURATION

You can also examine the configuration of this connector through SCOM PowerShell cmdlets with the following script (output shown in Figure 15):

Get-SCOMConnector -Id 6D481226-6A6D-4315-AC9E-022C35A33B9B |

Format-List *

FIGURE 15. RETRIEVING 'ADVISOR DATA CONNECTOR' VIA SCOM POWERSHELL MODULE

In the SCOM Operations console, you can see there is a subscription created for this connector, as shown in Figure 16.

FIGURE 16. 'ADVISOR DATA CONNECTOR' SUBSCRIPTION

As shown in Figure 17, the subscription that is automatically created for the 'Advisor Data Connector' sends all alerts generated in the SCOM management group to OMS.

FIGURE 17. 'ADVISOR DATA CONNECTOR' SUBSCRIPTION CONFIGURATION

IMPORTANT: Do not delete this connector subscription or the Alert Management solution will stop functioning.

Configuring Nagios and Zabbix Alert Collection

In addition to supporting Microsoft System Center Operations Manager alert collection, OMS also supports collecting alerts from Nagios or Zabbix servers running Linux computers that have an OMS agent installed.

Configuring Nagios Alert Collection

Nagios is an open-source computer-software application that offers monitoring and alerting services for servers, switches, applications and services.

  1. To collect alerts from a Nagios server, the following configuration changes must be made:

    Grant the user omsagent read access to the Nagios log file

    (i.e. /var/log/nagios/nagios.log). Assuming the nagios.log file is owned by the group nagios, you can add the user omsagent to the nagios group. sudo usermod –a -G nagios omsagent

  2. Modify the omsagent.conf configuration file

    (/etc/opt/microsoft/omsagent/conf/omsagent.conf). Ensure the following entries are present and not commented out:

    <source> type tail

    #Update path to point to your nagios.log path /var/log/nagios/nagios.log format none tag oms.nagios

    </source>

    <filter oms.nagios> type filter_nagios_log

    </filter>

6. Restart the omsagent daemon using the following command syntax.

Sudo service omsagent restart

Configuring Zabbix Alert Collection

Zabbix is an enterprise open source monitoring solution for networks and applications. It is designed to monitor and track the status of various network services, servers, and other network hardware. Zabbix uses MySQL, PostgreSQL, SQLite, Oracle or IBM DB2 to store data.

To collect alerts from a Zabbix server, you will perform similar steps to those detailed in the previous section for Nagios, except you will need to specify a Zabbix user and password in clear text in the omsagent.conf file. This is not ideal as it's a security risk, but to help mitigate the risk, we recommend that you create the user and grant only explicitly required permissions to monitor. For details on how to configure users and permissions in Zabbix, check the product documentation at https://www.zabbix.com/documentation/2.0/manual/config/users_and_usergroups/permissions

An example section of the omsagent.conf configuration file /etc/opt/microsoft/omsagent/conf/omsagent.conf for Zabbix looks similar to the following example:

<source> type zabbix_alerts run_interval 1m tag oms.zabbix

zabbix_url http://localhost/zabbix/api_jsonrpc.php zabbix_username Admin zabbix_password zabbix

</source>

OMS Alerting and Alert Remediation

Other than having the ability to integrate with third party monitoring systems such as SCOM, Nagios, and Zabbix, OMS also has its own native alerting capability. OMS Alerting is based on log search. You can create an OMS alert rule using a search query.

The OMS alert rule will run a search query based on a schedule that you have specified. The rule will generate an alert if the condition is met. Natively, you can configure the alert rule to carry out either of the following native actions:

  • Email notification
  • Triggering a webhook

Additionally, if you have connected an Azure Automation account to your OMS workspace, you can also configure the alert rule to trigger an Azure Automation runbook as the alert remediation task. These actions are accessible from the Add Alert Rule configuration page shown in Figure 18.

FIGURE 18. OMS ALERT CONFIGURATION PAGE

Note: Being able to trigger an Azure Automation runbook is a great feature from OMS Alerting. However, in order to utilize this feature, the Azure Automation Account that's hosting the runbook must be linked to the OMS Log Analytics workspace.

OMS Alert rules also support alert suppression. As shown in Figure 18, when the Suppress Alert check box is ticked, the actions for the rule are disabled for a defined period of time after creating a new alert. This setting allows administrators to have additional time to correct the problem without duplicate alerts being raised.

Triggering Azure Automation Runbooks with OMS Alerts

Similar to creating saved searches and groups, you can create an OMS alert rule from log search results. In the following sections, we will walk through the steps of creating an OMS alerting rule to detect any Windows service stopped events and we will use an Azure Automation runbook as a remediation task to start the service. We will also leverage the two additional Windows services custom fields (WindowsServiceState_CF and WindowsServiceName_CF) for the alert rule that we previously created in the "Custom Fields" section of "Chapter 6: Extending OMS Using Log Search".

Creating Azure Automation Runbook for Alert Remediation

Before creating the OMS alert rule, we must create the Azure Automation runbook for alert remediation. This runbook will be triggered by the OMS alert rule when alerts are raised and in our example, it will use remote WMI calls to start any of the services that are stopped if the Startup Type is set to Automatic.

Note: Since the runbook will connect to your Windows agent remotely using WMI, you will need to provision one or more Hybrid Workers that are able to connect to your Windows agents. You will also need to create an Azure Automation credential asset to store the server administrative user name and password. This credential will be used by the runbook when connecting to the Windows agents via remote WMI.

We will create a PowerShell script runbook called OMSAlert-RestartWinService. The source code of this runbook is shown as below:

Note: This script can be download from book's GitHub repository at https://github.com/insidemscloud/OMSBookV2 in the \Chapter 7 directory, the file name is OMSAlert-RestartWinService.ps1.

There are a number of ways that you can create this runbook. You can create it manually from the Azure portal, you can use the Azure Automation Add-On for PowerShell ISE, you can automate the creation process using an ARM template or you might simply use PowerShell. You can refer to Chapter 3 Process Automation in this book on how to create runbooks.

Once the PowerShell runbook is created, you must make sure you also publish it. When you click the "Start" button on the Azure portal, you should see the runbook has one optional input parameter called "webhookdata" as shown in Figure 19.

FIGURE 19. AZURE AUTOMATION RUNBOOK FOR OMS ALERT REMEDIATION

Before you start creating the OMS alert, you need to make sure an Azure Automation credential asset called "ServerAdminDefaultCred" is also created (shown in Figure 20).

The runbook will utilize this credential to connect to your servers.

FIGURE 20. CREATE CREDENTIAL ASSET FOR THE REMEDIATION RUNBOOK

Creating the OMS Alert Rule

Since we already published the remediation runbook in the Azure Automation account, we can now create the OMS alert rule.

You can follow the steps listed below to create an OMS alert rule.

  • In the OMS portal, perform a log search using the search query that you wish to use for the alert rule. In this case, we will use:

    Type=Event EventLog=System EventID=7036

    WindowsServiceState_CF=stopped

  • On the log search result page, click on the Alert icon shown in Figure 21.

FIGURE 21. CREATING OMS ALERT RULE FROM LOG SEARCH RESULT

  • In the "Add Alert Rule" page, fill the following information:
    • Name: Windows Services Stopped Alert
    • Description: This rule detects service stopped events on Windows computers.
    • Severity: Critical
    • Search Query: leave it as default (Use current search query)
    • Time Window: 5 Minutes (Supported range is 5 minutes to 24 hours)
    • Alert Frequency: 5 Minutes (Supported range is 5 minutes to 24 hours)
    • Generate Alerts based on: Number of results
    • Number of Results: Greater than 0 (Supported range is 0 to 10000)
    • Suppress alerts: unticked
    • Email Notification: Yes
      • Subject: OMS Alert Windows services stopped
      • Recipients: enter one or more email addresses separated by semi-colon
    • Webhook: No
    • Runbook: Yes
      • Select a runbook: OMSAlert-RestartWinService
      • Run on: Hybrid worker
      • Select a hybrid worker: Select a most appropriate hybrid worker group for your environment.
  • Your configuration should look similar to our example shown in Figure 22 and when you're ready, click Save to create the OMS alert rule.

FIGURE 22. CREATING OMS ALERT RULE "WINDOWS SERVICES STOPPED ALERT"

  • Click OK on the confirmation page to complete the process.

Note: When creating OMS alerts, make sure the value you entered for the time window matches alert frequency, otherwise you may start seeing duplicate alerts because some log records may be picked up more than once.

The OMS alert rule will automatically create a webhook for the remediation runbook. This webhook is used by the OMS alert rule to trigger the runbook.

As shown in Figure 23, the webhook name starts with "OMS Alert Remediation" and the expiration date is set to 1 year after the creation date. The webhook URL is not exposed to you so you cannot use this webhook in other places.

FIGURE 23. AZURE AUTOMATION RUNBOOK WEBHOOK CREATED BY OMS ALERT RULE

Since you have enabled email notification, when the alert is triggered OMS will email the recipients with the following information:

  • Alert name
  • Alert description
  • Alert severity
  • OMS workspace name
  • Remediation Runbook name and Job ID
  • Search interval start time
  • Search interval duration
  • Search query
  • Link to view the search results
  • Top10 results

The output of this notification email is shown in Figure 24.

FIGURE 24. OMS ALERT NOTIFICATION EMAIL

In addition to the notification email, you can also check the runbook job output from within the Azure portal. As you can see in Figure 25, the runbook is designed to check the service Startup Type before starting the service. It will only start the services that have their Startup Type set to Automatic. It will skip any services that are set to Manual or Disabled.

FIGURE 25. REMEDIATION RUNBOOK JOB OUTPUT

We can also verify the affected service has been started back up by searching OMS. For example, as shown in Figure 25, the SQL Browser service was stopped and detected by OMS. After this detection, the remediation runbook started the service back up. When we search OMS, we can see both the stopped and running events (as shown in Figure 26).

FIGURE 26. VERIFYING RESTART SERVICE REMEDIATION RUNBOOK USING OMS SEARCH

Using OMS Alert webhook remediation

Webhooks have become have become very popular and widely adopted over recent years. Many applications currently support webhook, such as GitHub, Slack, Office 365 groups, Azure Automation and MyGet.

Furthermore, you can even easily create your own web services that support webhook using Azure Functions using your favorite programming languages (such as PowerShell).

As discussed previously, when creating OMS alert rules, you also have the option to trigger a webhook when the alert is raised.

By default, the OMS alert rule sends a JSON payload to the webhook that contains the following data:

  • WebhookName
  • RequestBody
    • WorksapceId
    • AlertRuleName
    • SearchQuery
    • SearchResult
    • SearchIntervalStartTimeUtc
    • SearchIntervalEndtimeUtc
    • AlertThresholdOperator
    • AlertThresholdValue
    • ResultCount
    • SearchIntervalInSeconds
    • LinkToSearchREsults
    • Description
  • RequestHeader

Webhook remediation also gives you the option to include your own custom JSON payload. When you specify a custom JSON payload, it will replace the default payload. Most of the fields listed above under the RequestBody section can be used in your custom JSON payload if you wish to. You may also hardcode any additional data into the custom JSON payload (i.e. to include an authorization key for the target system).

The following fields shown in Table 2 can be included in the custom JSON payload:

Parameter

Variable

Description

AlertRuleName

#alertrulename

Name of the alert rule.

AlertThresholdOperator

#thresholdoperator

Threshold operator for the alert rule. Greater than or Less than.

AlertThresholdValue

#thresholdvalue

Threshold value for the alert rule.

LinkToSearchResults

#linktosearchresults

Link to Log Analytics log search that returns the records from the query that created the alert.

ResultCount

#searchresultcount

Number of records in the search results.

SearchIntervalEndtimeUtc

#searchintervalendtimeutc

End time for the query in UTC format.

SearchIntervalInSeconds

#searchinterval

Time window for the alert rule.

SearchIntervalStartTimeUtc

#searchintervalstarttimeutc

Start time for the query in UTC format.

SearchQuery

#searchquery

Log search query used by the alert rule.

SearchResults

N/A. Set "IncludeSearchResults":true

Records returned by the query in JSON format. Limited to the first 5,000 records.

WorkspaceID

#workspaceid

ID of your OMS workspace.

TABLE 2. WEBHOOK CUSTOM JSON PAYLOAD SUPPORTED FIELDS

For example, if you wish to only include alert name, result count, search result and a custom authorization key in the custom JSON payload, you may use the following:

{

"alertname":"#alertrulename",

"resultcount":"#searchresultcount"

"IncludeSearchResults":true "authorizationkey": 123456789

}

When using the webhook remediation feature and before saving the OMS alert, you also have the ability to test the webhook using the Test webhook button as shown in Figure 27.

FIGURE 27. WEBHOOK REMEDIATION SUPPORTS CUSTOM JSON PAYLOAD

Metric Measurement Alerts

The previous example alert we have created detects Windows service stopped events and is based on the number of records returned from the search query. This alert rule generates its alerts based on the "Number of results" option. This option works well when working with event based records. However, if you need to generate alerts when a certain performance counter value has exceeded a threshold, the "Number of results" approach has many limitations and may not work too well.

In this case, you can configure the alert rule to generate an alert based on the "Metric measurement" option shown in Figure 28.

FIGURE 28. GENERATE ALERT BASED ON METRIC MEASUREMENT

Using the "Metric measurement" option, we can use a search query that has an aggregate clause (measure command) and an interval clause (interval command).

The measurement command measure requires you to group on a field for measurement, i.e. measure avg(CounterValue) by Computer or measure max(CounterValue) by CounterPath

The interval command interval requires you to specify an aggregation interval, i.e. interval 30minutes or interval 2hours.

Note: The OMS search query syntax is well documented on the Microsoft Azure documentation site. If you need help building your alert search query, make sure you check it out here: https://docs.microsoft.com/en-us/azure/log-analytics/log-analytics-search-reference

To demonstrate the process, we will create a metric measurement OMS alert rule for the logical disk % Free Space for SQL servers. The alert rule will generate an alert for each logical disk when the % Free Space is below 20%.

We will use a computer group called "SQL Servers". This group was created using the following search query:

Type=Perf ObjectName= SQL* | measure count() by Computer

Note: Since the group membership is based on any computers that are sending SQL related performance data to OMS, as a prerequisite, you need to firstly configure OMS to collect at least one SQL performance counter where the object name starts with the phrase "SQL".

The query below is used for the alert rule:

Type=Perf ObjectName=LogicalDisk CounterName="% Free Space" InstanceName!=_Total Computer IN $ComputerGroups[SQL Servers] | measure avg(CounterValue) by CounterPath interval 5minutes

This query is scoped to the computers in the "SQL Servers" group. It aggregates the counter value by counter path in 5 minute intervals. Since it is possible that a computer may have multiple logical disks, by aggregating the counter value by counter path, we are able to generate alerts for each individual logical disk because the counter path is unique for every instance.

The query also excludes any "_Total" instances using the "InstanceName!=_Total" clause because we are interested in individual logical disks, not the total value.

The OMS alert rule is created using the following parameters:

  • Name: SQL Server Logical Disk Low % Free Space Alert
  • Description: This alert detects logical disk low % free space on SQL servers.
  • Severity: Critical
  • Time Window: 15 Minutes
  • Alert frequency: 15 Minutes
  • Generate alert based on: Metric measurement
  • Aggregate Value: Less than 20
  • Total Breaches: Greater than 1
  • Suppress Alerts: Enabled
  • Suppress Alerts for: 12 hours
  • Email Notification: Yes
  • Subject: OMS Alert SQL Server Logical Disk Low % Free Space
  • Recipients: <enter your email address>
  • Webhook: No
  • Runbook: No

In Figure 29 you can see how we have configured our Metric Measurement alert to detect low logical disk space on servers in our SQL Servers group.

FIGURE 29. CREATING METRIC MEASUREMENT ALERT BASED ON PERFORMANCE DATA

When the alert is generated, you will receive an email similar to the one in Figure 30 for each offending logical disk.

FIGURE 30. METRIC MEASUREMENT ALERT NOTIFICATION EMAIL

As shown in Figure 30, the I:\ drive on server CONFIGMGRPSS01 only has approximately

8.68% free space. Since it is below the configured threshold of 20%, an alert is generated. When we log on to the server, we can see in Figure 31 that it's clear the I:\ drive is running out of space therefore validating the alert is functioning as expected.

FIGURE 31. OFFENDING LOGICAL DISK

When you are creating the "Metric measurement" based alert rules, you need to make sure the aggregation interval you have specified in the search query is not greater than the alert time window and frequency. In this case, we have set the aggregation interval to 5 minutes and the alert frequency and time window to 15 minutes. Therefore, when the alert rule runs, 3 data points are returned for each logical disk.

With the "Metric measurement" option, you can also choose to configure the alert trigger to be based on either Total Breaches or Consecutive Breaches as shown in Figure 32.

FIGURE 32. TOTAL BREACHES OR CONSECUTIVE BREACHES OPTIONS

If you select Total Breaches, the alert is generated when the number of total breaches from the returned data points reaches the configured threshold (in this case, 1).

If you select the Consecutive Breaches option, the alert is only generated when there are only X number of consecutive data points that have breached the threshold. This option helps you eliminate the noise and false alerts because the alert is only generated when the alert condition has been met for a period of time.

Note: Based on the nature of Total Breaches and Consecutive Breaches options, you must make sure the number of total breaches or consecutive breaches does not exceed the number of data points returned from the search query.

In our example shown in Figure 29, we have also configured the alert rule to be suppressed for 12 hours (after the alert is generated). This is to ensure the administrators will have enough time to correct the low disk space condition on the offending server without having to deal with duplicate alerts.

CAUTION: When you have enabled the alert suppression, OMS does not just suppress the alert on the offending instances, but it suppresses the entire alert rule. Therefore, while the alert rule is suppressed, any new alert conditions will not be detected. In the previous example, if another server has reached the low disk space condition while the rule is suppressed, the alert will NOT be generated. Use this feature carefully!

Accessing OMS alerts in log search

When an alert is generated, in addition to the configured notifications and remediation actions, OMS also saves the alert as a record under the Alert log type. Therefore, you can easily access the historical alerts using the following query:

Type=Alert SourceSystem=OMS

The alert record not only contains details of the alert itself, it also contains the Azure Automation runbook job ID if runbook remediation was configured for the alert rule. You can use the value in the RemediationJobId field to check the runbook job status. This can be achieved in PowerShell using the AzureRM.Automation module: The script below is an example on how you can retrieve the OMS alert remediation runbook job details.

Note: This script can be download from book's GitHub repository at https://github.com/insidemscloud/OMSBookV2 in the \Chapter 7 directory, the file name is OMSAlert-RestartWinService.ps1.

Managing Alert Rules Lifecycle

After the alert rules are created, you can modify them by browsing to the Settings page and Alerts tab as shown in Figure 33.

FIGURE 33. MODIFYING EXISTING ALERT RULES

On the Alerts page, you have the option to delete the rule using the X icon and you can modify the rule using the pen icon.

You can also disable/enable the rule using the On / Off switch.

When an alert rule is created, in addition to the rule itself, the following objects may also be created:

  • A saved search under the "Alert" category (shown in Figure 34).
  • If Runbook remediation is configured, a webhook for the Azure Automation runbook.

FIGURE 34. SAVED SEARCH CREATED BY THE ALERT RULE

Note: When the alert rule is deleted, the saved search and the runbook webhook is not deleted. To keep things tidy, you will need to delete them manually.

Managing OMS Alerts on Mobile Devices

Microsoft has developed an OMS app for mobile devices. It is available on iOS, Android, and Windows Phone and can be downloaded from their respective app stores. If the version running on your mobile device is up to date, you can configure the app to send push notifications when OMS alert rules have generated alerts. This alert push feature is shown in Figure 35.

FIGURE 35. OMS ALERT PUSH NOTIFICATION ON MOBILE DEVICE

From within the OMS app, you can access historical alerts from the alerts view as shown in Figure 36:

FIGURE 36. ACCESSING HISTORICAL ALERTS IN THE OMS MOBILE APP

You can also access a dashboard for the particular alert if you tap into an alert from the alerts page (shown in Figure 37):

FIGURE 37. ALERT DASHBOARD IN THE OMS MOBILE APP

The OMS mobile app can be configured to only notify on alerts with a certain severity and you can manually configure this setting by going to the Settings page and then tap Change Alert Settings as shown in Figure 38.

FIGURE 38. CHANGE ALERT SETTINGS IN THE OMS MOBILE APP

Managing OMS Alerts using the Alert API

OMS Log Analytics provides several REST APIs for developers to integrate their solutions with OMS. One of these APIs is the Alert REST API. This API allows you to create and manage alerts in OMS.

Coupled with the OMS Log Search API, the Alert API allows you to create, edit and delete schedules against an existing saved search. It also allows you to create, edit and delete actions (alert remediation's) for OMS alerts.

The OMS Log Analytics Alert API is fully documented on the Microsoft documentation site here: https://docs.microsoft.com/en-us/azure/log-analytics/log-analytics-api-alerts

Rafaela Brownlie, a Microsoft Premier Field Engineer has also published a good article on how to enable and disable OMS alerts using the OMS Log Analytics Alert API via Azure Automation. You can find this article at https://blogs.msdn.microsoft.com/canberrapfe/2016/11/16/maintenance-mode-for-oms-alerts/

OMS Alert Remediation Runbook Design Guidelines

As previously discussed, when using the Azure Automation runbook remediation, the OMS alert rule creates a webhook for the runbook.

When developing runbooks to be used for the OMS alert remediation, make sure you use the following guidelines:

  1. The only required input parameter should be named "webhookdata" and it should be configured as an optional parameter.
  2. The sample script shown below demonstrates how to retrieve information from the input JSON payload webhookdata. You may use it as a reference when developing your OMS alert remediation runbooks.

Note: This script can be download from book's GitHub repository at https://github.com/insidemscloud/OMSBookV2 in the \Chapter 7 directory, the file name is WebhookDataHandlerSample.ps1.

Note: As mentioned in the comment from the script above, the search result field is named differently in the webhookdata JSON payload between the runbook remediation and the custom webhook remediation. When triggered by the runbook remediation, the search result field is passed into the runbook as "SearchResults". When triggered by the custom webhook method, it is called "SearchResult". This is a known bug at the time of writing. This behavior may change in the future.

OMS Alert Webhook Remediation Vs Runbook Remediation

Since Azure Automation supports webhook natively, and the runbook remediation utilizes a webhook to trigger the runbook, when creating OMS alert rules, an alternative option instead of using the native runbook remediation, is to use the custom webhook remediation to trigger remediation runbooks.

Comparing both methods, the custom webhook option seems more feature rich and flexible delivering the following advantages:

Additional information is passed into the runbook

When using custom webhooks, the following information is included in the default JSON payload, which is not available when using runbook remediation:

  • Workspace Id
  • Alert rule name
  • Search query
  • Search Interval Start Time UTC
  • Search Interval End Time UTC
  • Alert Threshold Operator
  • Alert Threshold Value
  • Result Count
  • Search Interval in Seconds
  • Link to Search Result
  • Description

In some cases, the inclusion of search query can be particularly useful. When using runbook remediation, you will most likely rely on the search result to retrieve the information you need for the runbook. However, if your OMS rule is detecting a condition where the number of results is 0 (i.e. it detects a missing event), then your search result will be empty. By passing the search query used by the OMS alert rule, it is possible that you can at least get some information from the search query (Event ID for example).

Ability to manually maintain the webhook

When the OMS alert rule creates the webhook for the remediation runbook, the webhook expiry date is set to 1 year from the creation date. The webhook URL is not visible to you.

By using the custom webhook remediation, you are responsible for creating the webhook. Therefore, you will know the webhook URL and you can set the webhook to have a longer validity period.

Ability to share webhooks with multiple alert rules

When using the runbook remediation, if you use the same runbook on multiple OMS alert rules, multiple webhooks will be created. By using the custom webhook remediation, you can share the same webhook with multiple OMS alert rules.

Ability to include additional data by using custom JSON payload

Custom webhook remediation allows you to specify custom JSON payload. This feature gives you the ability to pass additional information.

For example, this can be used in a scenario where you are creating two OMS alert rules to detect service stopped events for two separate services on two computer groups and you have developed a runbook to restart the service as a remediation. If these two groups of servers do not use the same administrator credential, by using the custom webhook remediation, you have the capability of specifying the administrator credential name via the custom JSON payload so you do not have to create two separate runbooks and hardcode the credential name within the runbook.

Ability to use runbooks located in Azure Automation accounts that are not associated with the OMS workspace.

In order to use the runbook remediation, the Azure Automation account that is hosting the runbook must be linked to the OMS workspace. The link between an Azure Automation account and an OMS workspace has the following requirements/limitations:

  • The OMS workspace and Azure Automation account must be located in the same Azure subscription, same Azure resource group, and Azure region.
  • An Azure Automation account can only be linked to one OMS workspace, and vice versa.

By using the custom webhook remediation, not only do you have the ability to interact with any systems that support incoming webhooks, but you will also have the ability to trigger runbooks located from Azure Automation accounts that are not linked to the OMS workspace.

Note: Despite the fact that custom webhook remediation is more flexible and feature rich, it does require more configuration and administration effort. The configuration process is not as straightforward as using runbook remediation. When you are developing your OMS alert rules, choose carefully based on your requirements.

Comparing OMS Alerts with SCOM Alerts

Today, OMS does not replace SCOM. It is more appropriate to view OMS and SCOM as two products that complement each other. From a monitoring and alerting point of view, SCOM is definitely more feature rich at this stage. OMS alerting rules have been improved significantly since it was first introduced, and it will continue to be improved at a cloud cadence.

If OMS Alerting does not meet your requirement, you can submit your suggestions or report bugs to the OMS Log Analytics User Voice: https://feedback.azure.com/forums/267889-azure-operational-insights

Summary

At the start of this chapter, we introduced you to the OMS Alert Management solution and discussed how to search alerts published by SCOM and to configure alert collection from Nagios and Zabbix. After that, we explored OMS Alerting and Alert Remediation using Azure Automation runbooks.

We discussed how you can manage OMS Alerting on your mobile devices through the use of the OMS app and we walked you through how to manage alerts using the Alert API. Towards the end of the chapter, we discussed the difference between Alert Webhook Remediation and Runbook Remediation before closing out with a comparison of OMS alerts versus SCOM alerts.