Service Health Monitoring

Background

Monitoring service health is beyond the standard system monitoring which only focuses on monitoring availability, utilization, and other basic metrics most system administrators attend to in their day-to-day work. Service Health Monitoring involves managing the overall health of a service or an application and all of its underlying infrastructure components.

Service Health

Service health is a combination of Availability, Performance, Alerts, and Analytics. It indicates whether a service is healthy or not. But what does healthy really mean in the context of service health monitoring? Can a service be considered healthy if it is available yet the performance is bad? How about if the performance is good, does that equate to service being healthy? Or maybe, alerts are flooding in but the service is available and performing well, what does that indicate? The truth is that not one factor indicates whether a service is healthy or not. When monitoring service health, you need to look at the perspective from a 360° view.

Availability Monitoring

Availability Monitoring is determining whether an application, a particular service, or an infrastructure component is up and running. It is the most important and fundamental requirement when monitoring systems and applications. However, availability monitoring is not useful without the service context and its relationship to Service Level Objectives (SLO). SLO defines what the availability should be (i.e. percentage over time) and how it should be measured.

The availability of services can be very different from component availability. In the context of monitoring SAP, you must understand that application is just a portion of the entire system landscape.

Following the three-tier architecture, you can conclude that monitoring SAP system landscapes involve monitoring the presentation, application, and database layers. On the other hand, monitoring component availability only involves keeping an eye on a specific component, independent from the other components within the enterprise infrastructure. This clearly signifies that monitoring service availability is more in-depth than monitoring component availability, which means it is more complex too.

For example, SAP ERP, an SAP application, may have uninterrupted availability for end-users but one or more of its underlying infrastructure components such as application servers, message servers, enqueue servers, and database servers can be down at any given time. How is this possible? Aren't all of these infrastructure components required to be running before you can say that the service is functional? To answer that, environments should have high-availability (HA) configured to make this possible. With HA configuration, infrastructure components will have redundancies to ensure all the single points of failure are covered. This is where the monitoring complexities emerge. With service health monitoring, these requirements can be properly distinguished and managed accordingly.

IT-Conductor can monitor the availability of user-defined services and the supporting infrastructure components with more flexibility. Using the platform's parallel processing engine, service availability monitoring would be simplified by depicting a correlation of the individual components of the operating system in such a way that you can easily see at a glance if a service is available or not, in relation to the other dependent components as well.

Performance Monitoring

Monitoring performance follows the same tier concept where you need to look at all the layers to ensure that the service is good in terms of its performance. Taking a deep dive into performance monitoring, let's take a look at the image below. In these tiers—services, applications, and infrastructure—you could see that performance is just one of the surrounding pillars. This signifies that service performance monitoring or management extends its definition to the overall state of an environment.

In the context of monitoring performance in SAP systems, IT-Conductor helps enhance performance management in the following areas:

  • End-user experience services can track specific sets of key performance indicators for a custom set of metrics, such as user, transactions, locations, dialog vs HTTP, etc.

  • Service discovery of application-aware components and relationships

  • Composite services assembled from other services and monitors

  • Easily configure target availability, response times, and notification if the goals are not met.

  • Drill down into services to discover the root cause.

Performance Monitoring also involves the measurement of performance against a customized set of key performance indicators (KPIs) over a period of time. This requires synchronized data series. However, response time is a misleading indicator of performance. To put it into perspective, a system may have 0.5 second response time, but there may still be constant complaints from users with bad response times.

The better way would be to create groups of transactions, by any combination of these attributes such as TCODE, USER, TASK TYPE, APP SERVER, USER TERMINAL, etc. They may represent business processes, such as Sales-Order-to-Cash, and service level objectives may be set up for average response times across orders (VA01, VA02, VA03), delivery (VL01N, VL02N, VL03N), shipping (VT01N, VT02N, VT03N), and billing (VF01, VF02, VF03). Then aggregate their KPIs where the default would still be dialog response time but it could be database response time, network GUI time, etc. Monitor and manage them automatically against service-level objectives and trigger actions based on policy-based exceptions such as alerts and notifications. All that along with other application, database, and system metrics as well as events so that correlation can be used if needed during performance analysis.

There's a multitude of data to analyze. It would be difficult to diagnose when issues occur. In the context of troubleshooting, correlation plays a huge part in finding out the root cause of the problem.

IT-Conductor makes it easier for you to associate this information through the expandable service grid levels.

Alerts

IT-Conductor manages alerts using policy-based exceptions where alerts can be filtered, time-synchronized, and automatically recovered, including targeted notification to the right analyst reducing the mean time to repair (MTTR).

Alerts are configured in IT-Conductor using the concept of thresholds and overrides. Upon completing the setup of the ITC gateway and the required configurations for the monitored systems, the platform will start receiving different metrics. Using thresholds and overrides, ITC can send out alert notifications via e-mail provided that the user has subscribed to receive notifications for that particular service.

For more information, you may read:

Analytics

With all the available information that the platform is receiving from the monitored systems, we have designed and built IT-Conductor to display real-time metrics in the service grid.

In the sample service grid below, you could see how the service grid can be expanded up to the most granular component in your system landscape. This makes troubleshooting hassle-free since you don't need to log in and out of systems and issue scripts or commands just to get the information (metric) that you need.

Moreover, you can preview the historical chart when you click directly from the service grid.

These charts can be customized depending on how the user wants to consume the information.

Here are some of the reports that you can generate on the platform:

Video

Last updated

Change request #660: