Introduction
Linux cluster is a group of Linux computers or nodes, storage devices that work together and are managed as a single system. A traditional clustering configuration has two nodes that are connected to shared storage (typically a SAN). With Linux clustering, an application is run on one node, and clustering software is used to monitor its operation.
A Linux cluster provides faster processing speed, larger storage capacity, better data integrity, greater reliability and wider availability of resources.
Failover
Failover is a process. Whenever a primary system, network or a database fails or is abnormally terminated, then a Failover acts as a standby which helps resume these operations.
Failover Cluster
Failover cluster is a set of servers that work together to provide High Availability (HA) or Continuous availability (CA). As mentioned earlier, if one of the servers goes down another node in the cluster can take over its workload with minimal or no downtime. Some failover clusters use physical servers whereas others involve virtual machines (VMs).
CA clusters allow users to access and work on the services and applications without any incidence of timeouts (100% availability), in case of a server failure. HA clusters, on the other hand, may cause a short hiatus in the service, but system recovers automatically with minimum downtime and no data loss.
A cluster is a set of two or more nodes (servers) that transmit data for processing through cables or a dedicated secure network. Even load balancing, storage or concurrent/parallel processing is possible through other clustering technologies.

If you look at the above image, Node 1 and Node 2 have common shared storage. Whenever one node goes down, the other one will pick up from there. These two nodes have one virtual IP that all other clients connect to.
Let us take a look at the two failover clusters, namely High Availability Failover Clusters and Continuous Availability Failover Clusters.
High Availability Failover Clusters
In case of High AVailability Failover Clusters, a set of servers share data and resources in the system. All the nodes have access to the shared storage.
High availability clusters also include a monitoring connection that servers use to check the “heartbeat” or health of the other servers. At any time, at least one of the nodes in a cluster is active, while at least one is passive.
Continuous Availability Failover Clusters
This system consists of multiple systems that share a single copy of a computer’s operating system. Software commands issued by one system are also executed on the other systems. In case of a failover, the user can check critical data in a transaction.
There are a few Failover Cluster types like Linux Server Failover Cluster (WSFC), VMware Failover Clusters, SQL Server Failover Clusters, and Red Hat Linux Failover Clusters.
Pre-Requisites
- OpsRamp Classic Gateway 11.0 and above (or) OpsRamp Cluster gateway
- Ensure that “adapter integrations'' addon is enabled in client configuration. Once enabled you can see Linux Fail-over Cluster integration under Setup -> Integrations -> Adapter section.
- Credentials : root / non-root privileges with a member of “haclient” group.
- Cluster management: Pacemaker.
- Accessibility: All nodes within a cluster should be accessible by a single credential set.
- For non-root users: Update “~/.bashrc” file with “pcs” command path across all cluster nodes.
Ex: export PATH=$PATH:/usr/sbin -> as a new line in ~/.bashrc file.
RGManager(Non-Pacemaker)
Pre-Requisites
- Credentials : should provide access to both root and non-root users.
- Cluster management: RGManager
- Accessibility: All the nodes within a cluster should be accessible by a single credential set.
- For non-root users: Update the following commands in “etc/sudoers” file to provide access for non-root users to execute these commands.
“/usr/sbin/cman_tool nodes,/usr/sbin/cman_tool status,/usr/sbin/clustat -l,/sbin/service cman status,/sbin/service rgmanager status,/sbin/service corosync status,/usr/sbin/dmidecode -s system-uuid,/bin/cat /sys/class/dmi/id/product_serial”
Note: Usually a linux cluster will be configured with a virtual-ip normally called as cluster-virtual-ip.We use this Ip for adding configurations during the installation of integration.
If the cluster-virtual-ip is not configured give the ip address of the reachable node associated with the cluster
Risks, Limitations & Assumptions
- Use pacemaker software to build local 3-node Linux failover cluster.
- Use cluster specific bash commands to fetch cluster related details. (ex: pcs status command)
- Each node’s /etc/hosts file should contain DNS details of all other nodes in a cluster.
- The credential should have read-only permission to fetch data out of pcs commands/pacemaker specific commands.
- All the nodes of a cluster should have the same credential-set.
- App can handle below 2 different types of alerts
- UnknownHost Exception
- JSch Exception
- Using metric as error type. In this case 2 different metrics mentioned in above point.
- App will send a critical alert for the very first poll when it encounters the above issues and the subsequent alert will be a recovery alert whenever the app doesn’t come across above issues in subsequent polls.
- App can’t control monitoring pause/resume actions based on above alerts.
- Platform has support to enable/disable configuration. So when a particular notification is generated then the customer can take action using UI/Commands.
Install the integration
- From All Clients, select a client
- Go to Setup > Integrations > Integrations.
- From Available Integrations, select Adapter > Linux Fail-over Cluster. Then Install Linux Fail-over Cluster Integration pop-up appears.
Note: Ensure that Adapter addon is enabled at client and partner levels. - Enter the following information:
Object Name | Description |
---|---|
Name | Name of the Integration. |
Upload Logo | Optional logo for the integration. |
GateWay Profiles | Select a gateway management profile to associate with the clients. To know more on how to create a GateWay Profile, refer to the register a gateway article. |

Configure the integration
- In CONFIGURATION section, click +Add.
- On Create Adapter Configuration, enter
- Name: Configuration name.
- IP Address / Host Name: IP address or host name of the target.
- Notification Alerts: Select TRUE or FALSE.
Notes:- By default False is selected.
- If you select TRUE, App will handle Critical/Recovery failure alert notifications for Connectivity, Authentication Exceptions.
- From the Credentials section, select Custom, enter Username and Password.
Note: These credentials are required to communicate with the target (cluster). - From the Resource Types & Metrics section, select the metrics and configure for availability and alert conditions, for Cluster & Server.
- In the Discovery Schedule section, configure how frequently the discovery action should trigger. Select Recurrence Pattern to add one of the following patterns:
- Minutes
- Hourly
- Daily
- Weekly
- Monthly
- In the Monitoring Schedule section, configure how frequently the monitoring action should trigger.
- Click Save.
After saving the configuration, the resources are discovered and monitoring is done as specified in the configuration profile.
The configuration is saved and displayed on the page.
You can perform the actions manually, like Discovery, Monitoring or even Disable the configuration.
The discovered resource(s) are displayed in the Infrastructure page under “Cluster”, with Native Resource Type as Linux Cluster
Note: The cluster nodes are displayed under Components.

Note: You can select Pacemaker or RGManager from the Cluster Type drop-down.
Supported Metrics
Resource Type: Cluster
Pacemaker
Metric Names | Description | Display Name | Unit |
---|---|---|---|
linux_cluster_nodes_status | Status of each node present in linux cluster. 0 - offline 1- online 2- standby | Cluster Node Status | |
linux_cluster_system_OS_Uptime | Time elapsed since last reboot in minutes. | System Uptime | m |
linux_cluster_system_cpu_Load | Monitors the system's last 1 min, 5 min and 15min load. It sends per cpu core load average. | System CPU Load | |
linux_cluster_system_cpu_Utilization | The percentage of elapsed time that the processor spends to execute a non-Idle thread(This doesn't includes CPU steal time) | SSystem CPU Utilization | % |
linux_cluster_system_memory_Usedspace | Physical and virtual memory usage in GB | System Memory Used Space | GB |
linux_cluster_system_memory_Utilization | Physical and virtual memory usage in percentage. | System Memory Utilization | % |
linux_cluster_system_memory_Utilization | Monitors cpu time in percentage spent in various program spaces User - The processor time spent running user space processes. System - The amount of time that the CPU spent running the kernel. IOWait - The time the CPU spends idle while waiting for an I/O operation to complete. Idle - The time the processor spends idle. Steal - Time the virtual CPU has spent waiting for the hypervisor to service another virtual CPU, running on a different virtual machine. Kernel Time Total Time | System CPU Usage Statistics | % |
linux_cluster_system_disk_Usedspace | Monitors disk used space in GB | System Disk UsedSpace | GB |
linux_cluster_system_disk_Utilization | Monitors disk utilization in percentage | System Disk Utilization | % |
linux_cluster_system_disk_Inode_Utilization | This monitor is to collect DISK Inode metrics for all physical disks in a server. | System Disk Inode Utilization | % |
linux_cluster_system_disk_freespace | Monitors the Free Space usage in GB | System FreeDisk Usage | GB |
linux_cluster_system_network_interface_Traffic_In | Monitors In traffic of each interface for Linux Devices | System Network In Traffic | Kbps |
linux_cluster_system_network_interface_Traffic_Out | Monitors Out traffic of each interface for Linux Devices. | System Network Out Traffic | Kbps |
linux_cluster_system_network_interface_Packets_In | Monitors In Packets of each interface for Linux Devices. | System Network In packets | packets/sec |
linux_cluster_system_network_interface_Packets_Out | Monitors Out packets of each interface for Linux Devices. | System Network out packets | packets/sec |
linux_cluster_system_network_interface_Errors_In | Monitors network in errors of each interface for Linux Devices. | System Network In Errors | Errors per Sec |
linux_cluster_system_network_interface_Errors_Out | Monitors Network Out traffic of each interface for Linux Devices | System Network Out Errors | Errors per Sec |
linux_cluster_system_network_interface_discards_In | Monitors Network in discards of each interface for Linux Device. | System Network In discards | psec |
linux_cluster_system_network_interface_discards_Out | Monitors network Out Discards of each interface for Linux Devices. | System Network Out discards | psec |
linux_node_system_OS_Uptime | Time elapsed since last reboot in minutes. | System Uptime | m |
linux_node_system_cpu_Load | Monitors the system's last 1 min, 5 min and 15min load. It sends per cpu core load average. | System CPU Load | |
linux_node_system_cpu_Utilization | The percentage of elapsed time that the processor spends to execute a non-Idle thread(This doesn't includes CPU steal time). | System CPU Utilization | % |
linux_node_system_memory_Usedspace | Physical and virtual memory usage in GB. | System Memory Used Space | Gb |
linux_node_system_memory_Utilization | Physical and virtual memory usage in percentage. | System Memory Utilization | % |
linux_node_system_cpu_Usage_Stats | Monitors cpu time in percentage spent in various program spaces User - The processor time spent running user space processes. System - The amount of time that the CPU spent running the kernel. IOWait - The time the CPU spends idle while waiting for an I/O operation to complete. Idle - The time the processor spends idle. Steal - The time virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine. Kernel Time Total Time | System CPU Usage Statistics | % |
linux_node_system_disk_Usedspace | Monitors disk used space in GB. | System Disk UsedSpace | Gb |
linux_node_system_disk_Utilization | Monitors disk utilization in percentage. | System Disk Utilization | % |
linux_node_system_disk_Inode_Utilization | This monitor is to collect DISK Inode metrics for all physical disks in a server. | System Disk Inode Utilization | % |
linux_node_system_disk_freespace | Monitors the Free Space usage in GB. | System FreeDisk Usage. | Gb |
linux_node_system_network_interface_Traffic_In | Monitors In traffic of each interface for Linux Devices. | System Network In Traffic. | Kbps |
linux_node_system_network_interface_Traffic_Out | Monitors Out traffic of each interface for Linux Devices. | System Network Out Traffic | Kbps |
linux_node_system_network_interface_Packets_In | Monitors in Packets of each interface for Linux Devices | System Network In packets | packets/sec |
linux_node_system_network_interface_Packets_Out | Monitors Out packets of each interface for Linux Devices | System Network out packets | packets/sec |
linux_node_system_network_interface_Errors_In | Monitors network in errors of each interface for Linux Devices | System Network In Errors | Errors per Sec |
linux_node_system_network_interface_Errors_Out | Monitors Network Out traffic of each interface for Linux Devices | System Network Out Errors | Errors per Sec |
linux_node_system_network_interface_discards_In | Monitors Network in discards of each interface for Linux Devices | System Network In discards | psec |
linux_node_system_network_interface_discards_Out | Monitors network Out Discards of each interface for Linux Devices | System Network Out discards | psec |
linux_cluster_service_status_Pacemaker | Pacemaker High Availability Cluster Manager. The status representation as follows : 0 - "failed", 1 - "active" & 2 - "unknown" | Pacemaker Service Status | |
linux_cluster_service_status_Corosync | The Corosync Cluster Engine is a Group Communication System. The status representation as follows : 0 - "failed", 1 - "active" & 2 - "unknown" | Corosync Service Status | |
linux_cluster_service_status_PCSD | PCS GUI and remote configuration interface. The status representation as follows : 0 - "failed", 1 - "active" & 2 - "unknown" | PCSD Service Status | |
linux_cluster_Online_Nodes_Count | Online cluster nodes count | Online Nodes Count | count |
linux_cluster_Failover_Status | Provides the details about cluster failover status. The integer representation as follows: 0 - cluster is running on the same node, 1 - there is failover happened. | Cluster FailOver Status | count |
linux_cluster_node_Health | This metrics gives the info about the percentage of online linux nodes available within a cluster. | Cluster Node Health Percentage | % |
linux_cluster_service_Status | Cluster Services Status. The status representation as follows: 0 - disabled, 1-blocked, 2 - failed, 3 - stopped, 4 - recovering, 5 - stopping, 6 - starting, 7 - started, 8 - unknown | Linux Cluster Service Status |
RG Manager (Non-Pacemaker)
Metric Names | Description | Display Name | Unit |
---|---|---|---|
linux_cluster_service_Status | Cluster Services Status. The status representation as follows: 0 - disabled, 1-blocked, 2 - failed, 3 - stopped, 4 - recovering, 5 - stopping, 6 - starting, 7 - started, 8 - unknown | Linux Cluster Service Status | |
linux_cluster_service_status_rgmanager | RGManager Service Status. The status representation as follows: 0 - \"failed\", 1 - \"active\" , 2 - \"unknown\" | RGManager Service Status | |
linux_cluster_service_status_CMAN | CMAN Service Status. The status representation as follows: 0 - \"failed\", 1 - \"active\" \u0026 2 - \"unknown\" | CMAN Service Status | |
linux_cluster_system_OS_Uptime | Time elapsed since last reboot in minutes. | System Uptime | m |
linux_cluster_system_cpu_Load | Monitors the system's last 1 min, 5 min and 15min load. It sends per cpu core load average. | System CPU Load | |
linux_cluster_system_cpu_Utilization | The percentage of elapsed time that the processor spends to execute a non-Idle thread(This doesn't includes CPU steal time) | SSystem CPU Utilization | % |
linux_cluster_system_memory_Usedspace | Physical and virtual memory usage in GB | System Memory Used Space | GB |
linux_cluster_system_memory_Utilization | Physical and virtual memory usage in percentage. | System Memory Utilization | % |
linux_cluster_system_memory_Utilization | Monitors cpu time in percentage spent in various program spaces User - The processor time spent running user space processes. System - The amount of time that the CPU spent running the kernel. IOWait - The time the CPU spends idle while waiting for an I/O operation to complete. Idle - The time the processor spends idle. Steal - Time the virtual CPU has spent waiting for the hypervisor to service another virtual CPU, running on a different virtual machine. Kernel Time Total Time | System CPU Usage Statistics | % |
linux_cluster_system_disk_Usedspace | Monitors disk used space in GB | System Disk UsedSpace | GB |
linux_cluster_system_disk_Utilization | Monitors disk utilization in percentage | System Disk Utilization | % |
linux_cluster_system_disk_Inode_Utilization | This monitor is to collect DISK Inode metrics for all physical disks in a server. | System Disk Inode Utilization | % |
linux_cluster_system_disk_freespace | Monitors the Free Space usage in GB | System FreeDisk Usage | GB |
linux_cluster_system_network_interface_Traffic_In | Monitors In traffic of each interface for Linux Devices | System Network In Traffic | Kbps |
linux_cluster_system_network_interface_Traffic_Out | Monitors Out traffic of each interface for Linux Devices. | System Network Out Traffic | Kbps |
linux_cluster_system_network_interface_Packets_In | Monitors In Packets of each interface for Linux Devices. | System Network In packets | packets/sec |
linux_cluster_system_network_interface_Packets_Out | Monitors Out packets of each interface for Linux Devices. | System Network out packets | packets/sec |
linux_cluster_system_network_interface_Errors_In | Monitors network in errors of each interface for Linux Devices. | System Network In Errors | Errors per Sec |
linux_cluster_system_network_interface_Errors_Out | Monitors Network Out traffic of each interface for Linux Devices | System Network Out Errors | Errors per Sec |
linux_cluster_system_network_interface_discards_In | Monitors Network in discards of each interface for Linux Device. | System Network In discards | psec |
linux_cluster_system_network_interface_discards_Out | Monitors network Out Discards of each interface for Linux Devices. | System Network Out discards | psec |
linux_node_system_OS_Uptime | Time elapsed since last reboot in minutes. | System Uptime | m |
linux_node_system_cpu_Load | Monitors the system's last 1 min, 5 min and 15min load. It sends per cpu core load average. | System CPU Load | |
linux_node_system_cpu_Utilization | The percentage of elapsed time that the processor spends to execute a non-Idle thread(This doesn't includes CPU steal time). | System CPU Utilization | % |
linux_node_system_memory_Usedspace | Physical and virtual memory usage in GB. | System Memory Used Space | Gb |
linux_node_system_memory_Utilization | Physical and virtual memory usage in percentage. | System Memory Utilization | % |
linux_node_system_cpu_Usage_Stats | Monitors cpu time in percentage spent in various program spaces User - The processor time spent running user space processes. System - The amount of time that the CPU spent running the kernel. IOWait - The time the CPU spends idle while waiting for an I/O operation to complete. Idle - The time the processor spends idle. Steal - The time virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine. Kernel Time Total Time | System CPU Usage Statistics | % |
linux_node_system_disk_Usedspace | Monitors disk used space in GB. | System Disk UsedSpace | Gb |
linux_node_system_disk_Utilization | Monitors disk utilization in percentage. | System Disk Utilization | % |
linux_node_system_disk_Inode_Utilization | This monitor is to collect DISK Inode metrics for all physical disks in a server. | System Disk Inode Utilization | % |
linux_node_system_disk_freespace | Monitors the Free Space usage in GB. | System FreeDisk Usage. | Gb |
linux_node_system_network_interface_Traffic_In | Monitors In traffic of each interface for Linux Devices. | System Network In Traffic. | Kbps |
linux_node_system_network_interface_Traffic_Out | Monitors Out traffic of each interface for Linux Devices. | System Network Out Traffic | Kbps |
linux_node_system_network_interface_Packets_In | Monitors in Packets of each interface for Linux Devices | System Network In packets | packets/sec |
linux_node_system_network_interface_Packets_Out | Monitors Out packets of each interface for Linux Devices | System Network out packets | packets/sec |
linux_node_system_network_interface_Errors_In | Monitors network in errors of each interface for Linux Devices | System Network In Errors | Errors per Sec |
linux_node_system_network_interface_Errors_Out | Monitors Network Out traffic of each interface for Linux Devices | System Network Out Errors | Errors per Sec |
linux_node_system_network_interface_discards_In | Monitors Network in discards of each interface for Linux Devices | System Network In discards | psec |
linux_node_system_network_interface_discards_Out | Monitors network Out Discards of each interface for Linux Devices | System Network Out discards | psec |
linux_cluster_service_status_Pacemaker | Pacemaker High Availability Cluster Manager. The status representation as follows : 0 - "failed", 1 - "active" & 2 - "unknown" | Pacemaker Service Status | |
linux_cluster_service_status_Corosync | The Corosync Cluster Engine is a Group Communication System. The status representation as follows : 0 - "failed", 1 - "active" & 2 - "unknown" | Corosync Service Status | |
linux_cluster_service_status_PCSD | PCS GUI and remote configuration interface. The status representation as follows : 0 - "failed", 1 - "active" & 2 - "unknown" | PCSD Service Status | |
linux_cluster_Online_Nodes_Count | Online cluster nodes count | Online Nodes Count | count |
linux_cluster_Failover_Status | Provides the details about cluster failover status. The integer representation as follows: 0 - cluster is running on the same node, 1 - there is failover happened. | Cluster FailOver Status | count |
linux_cluster_node_Health | This metrics gives the info about the percentage of online linux nodes available within a cluster. | Cluster Node Health Percentage | % |
linux_cluster_service_Status | Cluster Services Status. The status representation as follows: 0 - disabled, 1-blocked, 2 - failed, 3 - stopped, 4 - recovering, 5 - stopping, 6 - starting, 7 - started, 8 - unknown | Linux Cluster Service Status |
Risks, Limitations & Assumptions
- Component level threshold configuration is not possible.
- Resource level metric threshold customization and frequency setting are not possible.
- Usability issues in app configuration page while adding/editing.
- Optional configuration parameters cannot be defined.
- App upgrade is a manual process without version change.