ALARM SERVICE IN THE OPENCLOVIS SAFPLUS PLATFORM -

Introduction

An alarm serves as a notification of a specific event, which may or may not indicate an error. The Alarm Service provides a comprehensive mechanism for defining managed resources, detecting alarms, reporting their status, and managing alarm correlation. This article explores the fundamental concepts of alarms within the OpenClovis Safplus Platform.

What is the Alarm Service?

The alarm is a notification of a specific event. The alarm may or may not represent an error. The application observing this event would report this as an alarm. In the ASP, the Alarm Service provides the feature of defining the set of managed resources, the alarms that can occur on those resources, alarm reporting, alarm clearing, alarm soaking, and alarm correlation. The ASP Alarm service enables applications to notify the north-bound entity about erroneous conditions that can occur on managed resources. This service is compliant with the X.733 specification. It provides the functionality to model probable cause, severity, and category compliance to X.733 (ITU-T) specification. ASP Alarm service also provides filters that can be applied to the Alarm. It provides a mechanism to specify soaking time, generation rule, and suppression rule.

There can be a condition when there is a spurt of events to be reported as an alarm. In order to handle this condition, there is a soaking time defined for the alarm. The soaking time is the time till the alarm is observed before it is actually reported.
The generation rule and suppression rule are part of the alarm correlation feature. So, the reporting of the alarm depends upon the status of the alarms, which are part of these rules. While reporting the alarm, both the rules are evaluated. In these rules, alarms are bounded by AND or OR relation. The usage model of the Alarm service is a producer-consumer model. Applications that raise the Alarm are the producers of the Alarms, and the management applications are the consumers of these Alarms. Any application can raise an alarm to detect an erroneous condition on a managed resource. The usage model is similar to a publisher-subscriber model because application components and the management applications are unaware of each other, and each Management application receives the Alarms after subscribing to the Alarm’s event channel.

Alarm Characteristics

The mandatory parameters of X.733 specification are incorporated as part of the attributes of an Alarm. The category, probable cause, and perceived severity of the Alarm are the mandatory parameters of Alarm notification. There are five categories of Alarm, and the probable cause parameter defines the probable cause of the Alarm. The severity parameter defines six severity levels: cleared, indeterminate, warning, minor, major, and critical. The following table illustrates the mapping between the category and the list of probable causes of an Alarm.

Communication	Quality of service	Processing Error	Equipment	Environmental
Loss of signal	Response time excessive	Storage capacity problem	Power problem	Temperature unacceptable
Loss of frame	Queue size exceeded	Version mismatch	Timing problem	Humidity unacceptable
Framing error	Bandwidth reduced	Corrupt data	Processor problem	Heating/ventilation/cooling system problem
Local node transmission error	Retransmission rate excessive	CPU cycles limit exceeded	Dataset or modem error	Fire detected
Remote node transmission error	Threshold crossed	Software error	Multiplexer problem	Flood detected
Call establishment error	Performance degraded	Software program error	Receiver failure	Toxic leak detected
Degraded signal	Congestion	Software program abnormally terminated	Transmitter failure	Leak detected
Communications subsystem failure	Resource at or nearing capacity	File error	Receive failure	Pressure unacceptable
Communications protocol error		Out of memory	Transmit failure	Excessive vibration
LAN error		Underlying resource unavailable	Output device error	Material supply exhausted
DTE-DCE interface error		Application subsystem failure	Input device error	Pump failure
		Configuration or customization error	I/O device error	Enclosure door open
			Equipment malfunction
			Adapter error

Alarm State Definition

The alarm could be triggered after detecting an erroneous condition. When the Alarm is raised till it is reported, it goes through the following stages:

Alarm Raised State: The Alarm is in the raised state when the application, after detecting the erroneous condition, raises the Alarm.
Alarm Soaked State: After the Alarm is soaked for a certain period of time, it enters the Alarm soaked state.
Alarm Generated State: After the Alarm qualifies the generation rule, it enters the Alarm generated state. If the Alarm does not qualify the generation rule because of some dependent Alarm, it remains in the soaked state till the dependent Alarms are generated.
Alarm Masked State: If the Alarm is being masked because of the masking logic, it enters into the Alarm masked state. This condition would arise :
- If an alarm is already raised on a managed resource that is higher in the hierarchy, then the alarm raised on the resource will be masked.
- The suppression rule is satisfied for the alarm.
Alarm Reported State: If the Alarm is not masked, it enters the Alarm reported state.

Raising an Alarm on a Managed Resource

The Managed resources that would be raising the alarm are modeled as Alarm Managed Objects in the ASP. The managed resource could be a hardware or a software resource. The Alarms on the managed resources correspond to the probable cause attribute of the alarm managed object. An application can raise an Alarm on a managed resource using the ASP Alarm service when it detects a deviation from the normal operation.

Alarm service also allows the application to pass additional context information when raising the Alarm. After the Alarm is successfully reported, the Alarm Manager returns a unique Alarm handle for the raised Alarm. This handle is useful in case of the service impacting alarms while reporting the failure to the Availability Management Framework(AMF) service.

Alarm Filters

The filtering mechanism provided by the Alarm service prevents spurious Alarms from flooding the Alarm Manager. It also applies constraints that an Alarm must satisfy before it is reported as an Alarm. The filtering mechanism can be classified as soaking, masking, generation, and suppression rule.

Soaking: Soaking is the time duration for which the erroneous condition must persist before it is reported as an Alarm. Soaking allows an Alarm to be monitored for a specific time period.
Generation and Suppression Rules: An application can specify a generation rule to model dependencies between Alarms.
- A generation rule specifies the following:
  - The Alarm that needs to be generated.
  - Set of dependent Alarms.
  - The condition that has to be satisfied for the Alarm to be generated. This condition is specified as a logical relationship between the dependent Alarms.
- A suppression rule specifies the following:
  - The Alarm that needs to be suppressed.
  - Set of dependent Alarms.
  - The condition that needs to be met for the Alarm to be suppressed. This condition is specified as a logical relationship between the dependent Alarms.
- Alarm Masking: The Alarm service views the containment relationship of the Alarm MSO to implement hierarchical masking. The alarm-masking algorithm masks all Alarms of the same category within a subtree in the hierarchy.

Event Generation

An event is published for each reported Alarm. Interested services can obtain these events by using the alarm client event subscribe API. The payload of the event consists of the Alarm information and an Alarm handle. The Alarm information comprises probable cause, category, severity, specific problem, resource on which the Alarm was reported, Alarm State – raised or cleared, time stamp, and additional information of the Alarm. The Alarm handle is unique for every node and is also known as the Notification Identifier. The mapping of the Alarm handle to the reported Alarm is maintained in the Alarm Manager. The Alarm Service depends on the Event Service for the delivery of the event.

Alarm Life Cycle

Life Cycle presents the Alarm is raised till it is reported:

The Alarm enters the raised state when an application detects an erroneous condition and raises the Alarm. A raised Alarm can have an assert or clear status.
After the Alarm is soaked, it enters the soaked state.
When the Alarm qualifies the generation rule, it enters the generated state. This leads to the generation of other Alarms dependent on this Alarm. If the Alarm does not qualify under the generation rule, the Alarm stays in the soaked state.
The Alarm is reported if the status of the Alarm is clear after it successfully qualifies the generation rule,
If the status of the Alarm is asserted after the generated state, the Alarm is checked against the hierarchical masking rule. If the parent MO has an Alarm of the same category, the Alarm enters the masked state. Otherwise, the Alarm enters the reported state.
Alarms in the masked state move into a reported state after the Alarm on the parent MO of the same category is cleared.

Alarm Flow

This figure illustrates the Alarm flow from the detection of the erroneous condition to the reporting of the Alarm.

The application detects the erroneous condition and raises the Alarm.
Alarm client soaks the Alarm, and on persistence of the erroneous condition, it moves the Alarm from raised to soaked state.
When the Alarm qualifies the generation rule, the Alarm client moves the Alarm from the soaked state to the generated state.
After the generation of the Alarm, it is passed to the Alarm server. If the Alarm is not being masked or if the status of the Alarm is clear, then the Alarm server performs the following:
1. Generates a unique Alarm handle.
2. Updates the Alarm status in COR.
3. Publishes an event as described in the {Event Generation} section.
4. Reports the Alarm to the Fault Manager if it is a non service impacting Alarm.

Conclusion

The Alarm Service provides a structured and efficient approach to alarm management, ensuring that critical system events are properly detected, categorized, and reported. Implementing filtering mechanisms such as soaking, masking, generation, and suppression rules helps maintain system stability and high availability. Please reach out to us at support@openclovis.org to discover and apply the alarm service to your application.

OpenClovis Software: High Availability Platform – Management Platform