ALARM SERVICE IN THE OPENCLOVIS SAFPLUS PLATFORM


Introduction

An alarm serves as a notification of a specific event, which may or may not indicate an error. The Alarm Service provides a comprehensive mechanism for defining managed resources, detecting alarms, reporting their status, and managing alarm correlation. This article explores the fundamental concepts of alarms within the OpenClovis Safplus Platform.

What is the Alarm Service?

The alarm is a notification of a specific event. The alarm may or may not represent an error. The application observing this event would report this as an alarm. In the ASP, the Alarm Service provides the feature of defining the set of managed resources, the alarms that can occur on those resources, alarm reporting, alarm clearing, alarm soaking, and alarm correlation. The ASP Alarm service enables applications to notify the north-bound entity about erroneous conditions that can occur on managed resources. This service is compliant with the X.733 specification. It provides the functionality to model probable cause, severity, and category compliance to X.733 (ITU-T) specification. ASP Alarm service also provides filters that can be applied to the Alarm. It provides a mechanism to specify soaking time, generation rule, and suppression rule.

  1. There can be a condition when there is a spurt of events to be reported as an alarm. In order to handle this condition, there is a soaking time defined for the alarm. The soaking time is the time till the alarm is observed before it is actually reported.
  2. The generation rule and suppression rule are part of the alarm correlation feature. So, the reporting of the alarm depends upon the status of the alarms, which are part of these rules. While reporting the alarm, both the rules are evaluated. In these rules, alarms are bounded by AND or OR relation. The usage model of the Alarm service is a producer-consumer model. Applications that raise the Alarm are the producers of the Alarms, and the management applications are the consumers of these Alarms. Any application can raise an alarm to detect an erroneous condition on a managed resource. The usage model is similar to a publisher-subscriber model because application components and the management applications are unaware of each other, and each Management application receives the Alarms after subscribing to the Alarm’s event channel.

Alarm Characteristics

The mandatory parameters of X.733 specification are incorporated as part of the attributes of an Alarm. The category, probable cause, and perceived severity of the Alarm are the mandatory parameters of Alarm notification. There are five categories of Alarm, and the probable cause parameter defines the probable cause of the Alarm. The severity parameter defines six severity levels: cleared, indeterminate, warning, minor, major, and critical. The following table illustrates the mapping between the category and the list of probable causes of an Alarm.

Communication

Quality of service

Processing Error

Equipment

Environmental

Loss of signal

Response time excessive

Storage capacity problem

Power problem

Temperature unacceptable

Loss of frame

Queue size exceeded

Version mismatch

Timing problem

Humidity unacceptable

Framing error

Bandwidth reduced

Corrupt data

Processor problem

Heating/ventilation/cooling system problem

Local node transmission error

Retransmission rate excessive

CPU cycles limit exceeded

Dataset or modem error

Fire detected

Remote node transmission error

Threshold crossed

Software error

Multiplexer problem

Flood detected

Call establishment error

Performance degraded

Software program error

Receiver failure

Toxic leak detected

Degraded signal

Congestion

Software program abnormally terminated

Transmitter failure

Leak detected

Communications subsystem failure

Resource at or nearing capacity

File error

Receive failure

Pressure unacceptable

Communications protocol error

Out of memory

Transmit failure

Excessive vibration

LAN error

Underlying resource unavailable

Output device error

Material supply exhausted

DTE-DCE interface error

Application subsystem failure

Input device error

Pump failure

Configuration or customization error

I/O device error

Enclosure door open

Equipment malfunction

Adapter error

 

Alarm State Definition

The alarm could be triggered after detecting an erroneous condition. When the Alarm is raised till it is reported, it goes through the following stages:

  • Alarm Raised State: The Alarm is in the raised state when the application, after detecting the erroneous condition, raises the Alarm.
  • Alarm Soaked State: After the Alarm is soaked for a certain period of time, it enters the Alarm soaked state.
  • Alarm Generated State: After the Alarm qualifies the generation rule, it enters the Alarm generated state. If the Alarm does not qualify the generation rule because of some dependent Alarm, it remains in the soaked state till the dependent Alarms are generated.
  • Alarm Masked State: If the Alarm is being masked because of the masking logic, it enters into the Alarm masked state. This condition would arise :
    • If an alarm is already raised on a managed resource that is higher in the hierarchy, then the alarm raised on the resource will be masked.
    • The suppression rule is satisfied for the alarm.
  • Alarm Reported State: If the Alarm is not masked, it enters the Alarm reported state.

Raising an Alarm on a Managed Resource

The Managed resources that would be raising the alarm are modeled as Alarm Managed Objects in the ASP. The managed resource could be a hardware or a software resource. The Alarms on the managed resources correspond to the probable cause attribute of the alarm managed object. An application can raise an Alarm on a managed resource using the ASP Alarm service when it detects a deviation from the normal operation.

Alarm service also allows the application to pass additional context information when raising the Alarm. After the Alarm is successfully reported, the Alarm Manager returns a unique Alarm handle for the raised Alarm. This handle is useful in case of the service impacting alarms while reporting the failure to the Availability Management Framework(AMF) service.

Alarm Filters

The filtering mechanism provided by the Alarm service prevents spurious Alarms from flooding the Alarm Manager. It also applies constraints that an Alarm must satisfy before it is reported as an Alarm. The filtering mechanism can be classified as soaking, masking, generation, and suppression rule.

  • Soaking: Soaking is the time duration for which the erroneous condition must persist before it is reported as an Alarm. Soaking allows an Alarm to be monitored for a specific time period.
  • Generation and Suppression Rules: An application can specify a generation rule to model dependencies between Alarms.
    • A generation rule specifies the following:
      • The Alarm that needs to be generated.
      • Set of dependent Alarms.
      • The condition that has to be satisfied for the Alarm to be generated. This condition is specified as a logical relationship between the dependent Alarms.
    • A suppression rule specifies the following:
      • The Alarm that needs to be suppressed.
      • Set of dependent Alarms.
      • The condition that needs to be met for the Alarm to be suppressed. This condition is specified as a logical relationship between the dependent Alarms.
    • Alarm Masking: The Alarm service views the containment relationship of the Alarm MSO to implement hierarchical masking. The alarm-masking algorithm masks all Alarms of the same category within a subtree in the hierarchy.

Event Generation

An event is published for each reported Alarm. Interested services can obtain these events by using the alarm client event subscribe API. The payload of the event consists of the Alarm information and an Alarm handle. The Alarm information comprises probable cause, category, severity, specific problem, resource on which the Alarm was reported, Alarm State – raised or cleared, time stamp, and additional information of the Alarm. The Alarm handle is unique for every node and is also known as the Notification Identifier. The mapping of the Alarm handle to the reported Alarm is maintained in the Alarm Manager. The Alarm Service depends on the Event Service for the delivery of the event.

Alarm Life Cycle

Life Cycle presents the Alarm is raised till it is reported:

  1. The Alarm enters the raised state when an application detects an erroneous condition and raises the Alarm. A raised Alarm can have an assert or clear status.
  2. After the Alarm is soaked, it enters the soaked state.
  3. When the Alarm qualifies the generation rule, it enters the generated state. This leads to the generation of other Alarms dependent on this Alarm. If the Alarm does not qualify under the generation rule, the Alarm stays in the soaked state.
  4. The Alarm is reported if the status of the Alarm is clear after it successfully qualifies the generation rule,
  5. If the status of the Alarm is asserted after the generated state, the Alarm is checked against the hierarchical masking rule. If the parent MO has an Alarm of the same category, the Alarm enters the masked state. Otherwise, the Alarm enters the reported state.
  6. Alarms in the masked state move into a reported state after the Alarm on the parent MO of the same category is cleared.

Alarm Flow

This figure illustrates the Alarm flow from the detection of the erroneous condition to the reporting of the Alarm.

  1. The application detects the erroneous condition and raises the Alarm.
  2. Alarm client soaks the Alarm, and on persistence of the erroneous condition, it moves the Alarm from raised to soaked state.
  3. When the Alarm qualifies the generation rule, the Alarm client moves the Alarm from the soaked state to the generated state.
  4. After the generation of the Alarm, it is passed to the Alarm server. If the Alarm is not being masked or if the status of the Alarm is clear, then the Alarm server performs the following:
    1. Generates a unique Alarm handle.
    2. Updates the Alarm status in COR.
    3. Publishes an event as described in the {Event Generation} section.
    4. Reports the Alarm to the Fault Manager if it is a non service impacting Alarm.

Conclusion

The Alarm Service provides a structured and efficient approach to alarm management, ensuring that critical system events are properly detected, categorized, and reported. Implementing filtering mechanisms such as soaking, masking, generation, and suppression rules helps maintain system stability and high availability. Please reach out to us at support@openclovis.org to discover and apply the alarm service to your application.