Resolving Semaphore Race Conditions in SAFplus


Introduction

In large and especially complex distributed systems consisting of many nodes or clusters, it is very difficult to face bugs and trace the root cause. OpenClovis is the same. Almost all common basic bugs have been handled to perfection by the development team, but sometimes some bugs still occur, which we call race conditions. This is when users, customers, and development teams need to cooperate with each other to achieve a more perfect system. With OpenClovis, we have provided a great support tool, https://cloviszilla.openclovis.com/. Here, users or customers can easily ask questions, make requests, and report bugs for flexible exchange and communication, and the development team will also quickly grasp and handle them smoothly.
The article below will introduce a typical case of smooth communication and handling between customers and development teams. Specifically, bug 197 is a race condition.

Background

The customer creates a request for help with a bug and provides some necessary files and some initial observations when facing the problem.

Upgrade Sequence:

  • SYS_CTRLI0 was upgraded and restarted.
  • asp start was triggered, but the system did not come up.
  • amf_watchdog, Safplus_amf, and safplus_logd were running.

The only logs we were seeing were:

  • Online a bout “Welcome to OpenClovis …” log in SYS_CTRLI0.log.
  • One line in sys.latest saying “OpenClovis_Logfile”. We did not even see the expected “Welcome to OpenClovis …” in the second line.

Troubleshooting Attempts:

  • Restarting asp multiple times had no effect.
  • Clearing files under /dev/shm/CL_*, var/run/*, and var/lib/* did not resolve the issue.

The system was recovered only by performing a reboot of the Card.

Initial Analysis

The development team immediately responded to the initial analysis with a quick look at what happened based on the report:

“AMF(safplus_amf) process logs “Welcome to OpenClovis …” into global array, and it is pushed to sys.latest file by LOG(safplus_log)process.
One possible point might be the call to msync() with MS_SYNC, which flushes the changes made to the in-core copy of a file that was mapped into memory using mmap() back to disk. MS_SYNC asks for an update and waits for it to complete.”

  • Suspected blocking during initialization in safplus_logd.
  • Focus shifted to msync() calls with MS_SYNC, which could stall while flushing memory-mapped changes to disk.

This response is both an initial analysis and a notification to the customer that we are aware of the issue and that further analysis and troubleshooting will follow.

Deeper Investigation:

The image above details where the problem occurs and explains it in a way that customers can understand.

Specifically, it is as follows:

Identified a race condition in semaphore creation and initialization. If two processes attempted to create or initialize the same semaphore simultaneously, one process could reset the semaphore value after the other had incremented it. This led to the semaphore being permanently “taken,” causing processes to hang at semop().

Proposed Fix

To eliminate the race condition, engineers introduced:

  • IPC_EXCL | IPC_CREAT flags → ensuring only one process successfully creates a semaphore.
  • Other processes retry without reinitializing.

This change prevents multiple processes from interfering with semaphore initialization.

The patch was merged into SAFplus to address this condition.

Impact

  • Severity: High — the system would hang until a reboot.
  • Root Cause: Race condition in semaphore creation and initialization.
  • Resolution: Patch ensuring safe semaphore creation.

Conclusion

This case study highlights how subtle synchronization bugs can cripple complex HA (High Availability) systems like SAFplus:

  • Even core services can appear to run, while hidden deadlocks stall recovery.
  • Proper use of IPC_EXCL and cautious handling of semaphores are crucial in multi-process environments.
  • A disciplined debugging approach—moving from observation to deeper investigation—was key to isolating the issue.

Thanks to cloviszilla, easy interaction between users, customers, and the development team makes problem analysis, handling, and resolution smoother and faster. If you have an issue, just use https://cloviszilla.openclovis.com/ to send it to us, and if you have any other questions, please do not hesitate to contact us at support@openclovis.org