Application Aware High Availability with OpenClovis

Ad-Hoc High Availability = Restarting Software

It is possible to provide a limited form of high availability simply by restarting software or machines (nodes) that fail. However, this primitive solution suffers from many drawbacks, such as slow recovery times and loss of state during failure. And the truth is that these solutions tend to grow in complexity as the application’s true requirements become apparent.

For example, either a manual restart is required on failure, or process and node monitoring software (which itself can fail and so must be monitored) must be written. And since minimizing loss of state during a failure is important, many ad-hoc solutions write essential information to disk. Depending on the amount of state, this can dramatically impact runtime performance since disks are extremely slow (it is very difficult to estimate how much state information will be needed at the beginning of a project but estimates are somehow always low)! Finally, an ad-hoc solution rarely takes advantage of the synergies between high availability and scalability, and does not consider the complexities in managing a single logical entity (a network element) whose constituent elements (the individual machines, VMs, and processes) are ephemeral.

Application Aware High Availability = SAF-aware

By making high availability and scalability an intrinsic part of your design, the problems above can be easily solved. The first step is to make your applications minimally aware of high availability roles. This process is called making your application “SAF-aware” after the Service Availability Forum‘s standard API. The process is quite simple. In your “main” function, you simply register with your availability management framework/service (variously called AMS or AMF) provider and then wait. Rather than starting your application’s “job” right away, let the availability management software tell your application to become “active” or “standby”:

int main(int argc, char *argv[])
{
    SaAisErrorT rc = SA_AIS_OK;

    /* Connect to the SAF cluster */
    initializeAmf();

   /* Do the application specific initialization here. */
    
    /* Block on AMF dispatch file descriptor for callbacks.
       When this function returns its time to quit. */
    dispatchLoop();
    
    /* Do the application specific finalization here. */

    /* Now finalize my connection with the SAF cluster */
    if((rc = saAmfFinalize(amfHandle)) != SA_AIS_OK)
      clprintf (CL_LOG_SEV_ERROR, "AMF finalization error[0x%X]", rc);
    else
      clprintf (CL_LOG_SEV_INFO, "AMF Finalized");   

    return 0;
}

More information on specific functions can be found at our Availability Management Framework page.

In the initializeAmf() function, you will specify callbacks so that the AMF will tell you what role to assume. And you’ll also deal with “housekeeping” stuff like verifying that your application is talking the same AMF protocol version as the server is:

void initializeAmf(void)
{
    SaAmfCallbacksT     callbacks;
    SaVersionT          version;
    SaAisErrorT         rc = SA_AIS_OK;

    version.releaseCode  = 'B';
    version.majorVersion = 01;
    version.minorVersion = 01;
    
    callbacks.saAmfHealthcheckCallback          = NULL; /* rarely necessary because SAFplus automatically monitors the process */
    callbacks.saAmfComponentTerminateCallback   = safTerminate;
    callbacks.saAmfCSISetCallback               = safAssignWork;
    callbacks.saAmfCSIRemoveCallback            = safRemoveWork;
    callbacks.saAmfProtectionGroupTrackCallback = NULL;
        
    /* Initialize AMF client library. */
    if ( (rc = saAmfInitialize(&amfHandle, &callbacks, &version)) != SA_AIS_OK)
      {
        /* Not running in a high availability framework.  You could exit here, or just call your "active" function if you want
           your application to work even if there is no framework running (useful for debugging) */
        errorExit(rc);
      }

    /*
     * Now register the component with AMF. At this point it is
     * ready to provide service, i.e. take work assignments.
     */

    if ( (rc = saAmfComponentNameGet(amfHandle, &appName)) != SA_AIS_OK) 
        errorExit(rc);
    if ( (rc = saAmfComponentRegister(amfHandle, &appName, NULL)) != SA_AIS_OK) 
        errorExit(rc);
}

Your application should then essentially enter into an “idle” mode. When the availability framework wants your application to assume a role it will call the function you specified (in the above case it is “safAssignWork”). The API is complex at first glance because it allows an application developer to implement interesting high availability functionality such as allowing applications to be simultaneously active or standby for multiple tasks. However, to implement simple active/standby high availability most of these fields can simply be ignored. Making an application “SAF-aware” can be as simple as renaming “main” to something else (say “originalMainFunction”) and then spawning it in a thread when the callback makes the application “active”:

void safAssignWork(SaInvocationT invocation,const SaNameT *compName,SaAmfHAStateT haState, SaAmfCSIDescriptorT csiDescriptor)
{
    switch ( haState )
    {
        case SA_AMF_HA_ACTIVE:  /* Ok this process should become active, so start up a "main" thread that actually implements the application */
        {
            pthread_t thr;
            pthread_create(&thr,NULL,originalMainFunction,NULL);
            saAmfResponse(amfHandle, invocation, SA_AIS_OK);
            break;
        }
         case SA_AMF_HA_STANDBY:
        {
            /* If your standby has ongoing maintenance, you would spawn a thread
               here to do it. */
            
            break;
        }
    }
  saAmfResponse(amfHandle, invocation, SA_AIS_OK);  
}

You can also add additional “case” statements to handle graceful and abrupt shutdown of your application:

        case SA_AMF_HA_QUIESCED:
        {
            /*
             * AMF has requested application to quiesce the CSI currently
             * assigned the active or quiescing HA state. The application 
             * must stop work associated with the CSI immediately.
             */
            running = 0;
            saAmfResponse(amfHandle, invocation, SA_AIS_OK);
            break;
        }

        case SA_AMF_HA_QUIESCING:
        {
            /*
             * AMF has requested application to quiesce the CSI currently
             * assigned the active HA state. The application must stop work
             * associated with the CSI gracefully and not accept any new
             * workloads while the work is being terminated.
             */

            /* There are two typical cases for quiescing.  Chooose one!
               CASE 1: Its possible to quiesce rapidly within this thread context */
            if (1)
              {
              /* App code here: Now finish your work and cleanly stop the work*/
                
              clprintf(CL_LOG_SEV_INFO,"csa102: Signaling completion of QUIESCING");
              running = 0;
              /* Call saAmfCSIQuiescingComplete when stopping the work is done */
              saAmfCSIQuiescingComplete(amfHandle, invocation, SA_AIS_OK);
              }
            else
              {
              /* CASE 2: You can't quiesce within this thread context or quiesce
               rapidly. */

              /* Respond immediately to the quiescing request */
              saAmfResponse(amfHandle, invocation, SA_AIS_OK);

              /* App code here: Signal or spawn a thread to cleanly stop the work*/
              /* When that thread is done, it should call:
                 saAmfCSIQuiescingComplete(amfHandle, invocation, SA_AIS_OK);
              */
              }

            break;
        }

That’s it! Your application is now “SAF-aware” and can be automatically monitored, restarted, and assigned work by an appropriate framework. Of course, your application is not identifying critical state so when a failure happens state will be lost. To do so requires using a service called “Checkpointing”…

Other support, please send email to support@openclovis.org.

OpenClovis Software: High Availability Platform – Management Platform

OpenClovis Software: High Availability Platform – Management Platform

Implement High Availability Easily with SAFPlus

Ad-Hoc High Availability = Restarting Software

Application Aware High Availability = SAF-aware