OpenClovis Availability Management Framework (AMF) -

Introduction

Availability Management Framework (AMF) manages almost the activities of a component and node: starting/stopping component, doing health check of component, shutting down a node, performing policies on entities: instantiating Node, SU, SG, shutting down Node, computing recovery action when component failure, performing CSI Set/CSI removal… let’s discover the functionalities of AMF in the following sections.

Overview

The OpenClovis AMF functions are collectively implemented in the Availability Management Service (AMS) and the Component Manager (CPM) software components.

AMS is the primary decision making entity in the system with respect to service availability.

The CPM is the operational manager entity and is responsible for managing the various components through their life cycle, based on the configured policy as well as directions from AMS.

AMS and CPM together maintain a view of the system model that describes the various software and hardware components in the system, their attributes, dependencies on each other, rules for grouping them together for providing services availability, current state and policies to be applied in the event of failure.

Availability Management Service (AMS)

Availability Management Service (AMS) comes with a default set of policies on how to recover a service. These include restart or switchover of services to a standby. The following are the functions of AMS:

Defining System Model
Service Unit and Component Lifecycle Management
Service Group Management
Best Possible Assignment at Boot-time
CSI Assignment
Node Management
Administrative Control
Fault Recovery
Faults at Node
AMF Client AP

Component Manager (CPM)

The CPM is an operational manager entity, responsible for managing the various Service Units and components through their lifecycle, based on configured policy and instructions from the AMS. The following are the functions of CPM:

Hierarchical Component Management
Boot Management
Basic Component Lifecycle Management
Component Health Monitoring
Failure Event Generation
Automatic Logging of Health State Changes

CPM-AMS interaction

In SAFplus Platform world, the Availability Management Framework is implemented as two entities Availability Management Service (AMS) and Component Manager (CPM). Even though both the entities are in the same process, there is a very clear separation of functionalities between AMS and CPM.

AMS is a ‘server library’ that maintains the (configuration and status) information about the whole cluster, knows the policies that has been configured for different entities, the recovery (or repair) actions to be taken in case of an application failure, based on both the configured policies as well as the current cluster system state. AMS deals only with application components.

CPM is an operational entity which simply carries out the commands given by the AMS. However, CPM does manage the complete life cycle of the SAFplus Platform components. The recovery policy for the SAFplus Platform components is unlike AMS, very primitive. If an SAFplus Platform component fails, it will be simply restarted. There is a limit to these restart attempts however, after which the node will be shutdown.

Interactions between AMS and CPM in brief:

1. AMS to CPM :

Call to CPM for managing the life cycle of the component.
Call to CPM for assigning (and removing) work to the component.
Informing CPM that some node for whom the shutdown request has come can (or cannot) be shutdown.
Informing CPM when taking node level recovery actions. (Node failfast and node failover.)

2. CPM to AMS :

Responding to AMS the success/failure of component life cycle or work assignment/removal request.
Reporting fault on the component, whenever a failure on that component is detected.
Querying AMS about the HA state of the component.
Informing to AMS that node is joining the cluster and is ready to provide service.
Informing to AMS that some node wants to leave the cluster and so if it is permitted for the node to be shutdown, switchover all the components and do the shutdown.
Informing to AMS that some node has exited ungracefully and hence do the failover the components without actually trying to contact the failed node.
Informing to AMS about availability/unavailability of some of the SAFplus Platform components.

CPM architecture

In an SAFplus Platform cluster, there is a CPM running on every node, managing the components on that node. This is in contrast to the AMS, which will be brought into life by CPM only if that node happens to be a system controller.

Based on whether AMS is initialized by CPM or not there are two types of CPM:

Local Component Manager. This is also called \b CPM/L (L standing for the local). The node (or blade) where CPM/L is running is called Worker Blade. The word Worker Node also means the same thing.

Global Component Manager. This is also called \b CPM/G (G standing for the global). The node (or blade) where CPM/G is running is called System Controller Blade or simply System Controller .

The terms CPM/G and System Controller, CPM/L and Worker Blade (Worker Node) are used interchangeably, sometimes referring to the CPM running on the node as an entity and sometimes referring to the node itself where the particular type of CPM is running.

The differences between CPM/G and CPM/L are :

The AMS will be running only on System Controller node. The system controller node where AMS is running in active mode is said to have and Active HA state. Similarly the system controller where the AMS is running in standby mode is said to have Standby HA state. There will be no AMS running on a Worker Node however and so worker node does not have any HA state associated with it.

The CPM/G(Both active and standby) maintains the information about all the nodes (but not the components in those nodes) in the cluster and does the heart beating of all the nodes which are up and running. The CPM/L does not maintain any such information.

The active CPM/G or active system controller blade is the one where all the HA related activities take place. It is the place to where all the triggers about various eventshappening in the cluster (e.g. node failure, user component failure, node is ready to provide service etc) will go either directly or indirectly and the corresponding actions are taken.

On ungraceful termination of any node, CPM/G will publish an event providing information about the failed node.

The active CPM/G acts as an intermediate entity between AMS and real world. It is the one which brings AMS to life and keeps it informed about the events which are going on in the cluster such as :
1. Node is joining because it is ready to provide service.
2. Node is leaving because shutdown was issued on the node.
3. Node has left the cluster, because it was killed or crashed for some reason (e.g. communication failure, kernel panic etc)

The active CPM/G is the one which actually carries out various operations specified by the AMS in reaction to the various events

The commonalities between CPM/G and CPM/L are :

1. Both types of CPM manage the life cycle of the SAFplus Platform components on the local node.

2. Heart beating of the components. Both types of the CPMs do heart beating of the local components and take the following actions :

If the heart beat loss was for an SAFplus Platform component, then the CPM will restart it. But if this failure happens more than a certain number of times within a particular time window, then the node will be shutdown gracefully.
If the heart beat loss was for an user defined component, then the CPM will report the failure to AMS via clCpmComponentFailureReport() API.

3. On death of any component whether SAFplus Platform or user defined, CPM will publish an event using event management API, providing information about the failed component.

Functions of AM

AMF is built with the close association of two OpenClovis SAFplus Platform components, the Component Manager (CPM) and the Availability Management Service (AMS). These two components work together as defined by the Service Availability Forum. AMS is the primary decision making entity towards service availability in the system. CPM is the operational manager and is responsible for actually managing the various components through their lifecycle, depending on the configured policies.

The main functions provided by AMF are:

It maintains the view of one logical cluster that comprises several cluster nodes. These nodes host various resources in a distributed computing environment that describes the software and hardware components of the system.
It stores information about attributes of the resources, their dependencies, rules for grouping them together for providing services, their current state, and the set of policies to apply on failures.

Using OpenClovis IDE, you can configure the desired availability policies with AMF on how to recover in case of failure of a service.

AMF handles fault detection, fault isolation, and redundancy wherein constant health monitoring of the components is performed. Fault Manager handles fault recovery and repair actions.

Classification of Components:

Classification Based on Integration Mechanism with AMF

SA-Aware Components: SA-Aware components contain processes linked to the AMF library. The process of registering the component is called the Registered Process with AMF. AMF utilizes this process to handle callbacks.

Non-SA-Aware Components: there is no registered process for non-SA-aware components. Non-SA-aware components use proxy components to register with AMF. All the available processes are application processes. Examples of non-SA-aware components are scripts for applications, hardware specific resources, and so on.

Proxy component
Proxied Component
Proxy-Proxied Relationship

Classification Based on Run-time Characteristics

Pre-Instantiable Components
Non-Pre-Instantiable Components

Conclusion

AMF is the heart of High Availability service. It manages almost all activities of entities such as start/stop, instantiate/booting of Nodes, Components…; does health check components; computes the recovery when a component gets failure and performs the actions after that. It comprises 2 parts which play the respective roles: CPM and AMS. They don’t work alone but they cooperate with each other to manage the lifecycle of all components.

Other support, please send email to support@openclovis.org.

OpenClovis Software: High Availability Platform – Management Platform