High availability enables your IT infrastructure to function continuously though some of the components may fail. High availability plays a vital role in case of a severe disruption in services that may lead to severe business impact.
It is a concept that entails the elimination of single points of failure to make sure that even if one of the components fail, such as a server, the service is still available.
Failover is a process. Whenever a primary system, network or a database fails or is abnormally terminated, then a Failover acts as a standby which helps resume these operations.
Failover cluster is a set of servers that work together to provide High Availability (HA) or Continuous availability (CA). As mentioned earlier, if one of the servers goes down another node in the cluster can take over its workload with minimal or no downtime. Some failover clusters use physical servers whereas others involve virtual machines (VMs).
CA clusters allow users to access and work on the services and applications without any incidence of timeouts (100% availability), in case of a server failure. HA clusters, on the other hand, may cause a short hiatus in the service, but system recovers automatically with minimum downtime and no data loss.
A cluster is a set of two or more nodes (servers) that transmit data for processing through cables or a dedicated secure network. Even load balancing, storage or concurrent/parallel processing is possible through other clustering technologies.
The above image shows an application that runs on a primary or master server. A dedicated redundant server is present to take over on any failure. The redundant server is not configured to perform any other functions. The redundant server is on stand-by with full performance capability.
Veritas Cluster Server (VCS)
Veritas Cluster Server is a high-availability cluster software for Unix, Linux and Microsoft Windows computer systems.
Veritas Cluster Server connects multiple, independent systems into a management framework for increased availability. Each system or node runs its own operating system and cooperates at the software level to form a cluster. VCS links commodity hardware with intelligent software to provide application failover and control. So when a node or a monitored application fails, other nodes take over and bring up services elsewhere in the cluster.
How VCS detects failure
VCS detects failure by issuing specific commands or scripts to monitor the overall health of an application. VCS also determines the health of underlying resources supporting the application, such as network interfaces or file systems.
VCS uses a redundant network heartbeat to differentiate between the loss of a system and the loss of communication between systems.
How VCS ensures application availability
When VCS detects an node or application failure, VCS brings application services up on a different node in a cluster. VCS virtualizes IP addresses and system names, so client systems continue to access the application without any interruption.
Integration is done with Veritas cluster using SSH and shell script.
OpsRamp Classic Gateway 10.0 and above (or) OpsRamp Cluster gateway
Ensure that “adapter integrations” add-on is enabled in client configuration. Once enabled you can see Veritas Cluster integration under Setup -> Integrations -> Adapter section
It is mandatory to provide administrator (or) operator level VCS credentials in input configuration. VCS credentials are required to fetch veritas_cluster_group_State, veritas_cluster_group_Status, veritas_cluster_node_State, veritas_cluster_resource_State, veritas_cluster_resource_Status, veritas_cluster_group_failover_Status metric details.
In order to get Additional metrics ( veritas_cluster_lltLinks_State, veritas_cluster_lltInterface_Status), you need to provide permissions for the non-root SSH user. So, you need to make an entry as below in “/etc/sudoers”.
cat /etc/sudoers Allow root to run any commands anywhere root ALL=(ALL) ALL content ALL=NOPASSWD:/usr/sbin/lltstat -n,/usr/sbin/lltstat -nvv configured
In case root SSH Credentials are provided, no need to configure 2nd prerequisite.
All nodes IP Address should be publicly available or it should be configured in “/etc/hosts” of all the existing nodes in the Veritas cluster.
cat /etc/hosts [root@centos-node1 ~]# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 172.26.1.25 centos-node1 172.26.1.26 centos-node2
non-root user running VCS commands must have a home directory on the system on which the VCS commands will be run.
pwd [content@centos-node1 ~]$ pwd /home/content
Install the integration
- From All Clients, select a client
- Go to Setup > Integrations > Integrations
- From Available Integrations, select Adapter > Veritas Cluster. The Install Veritas Cluster Integration popup appears.
Note: Ensure that Adapter addon is enabled at client and partner levels.
- Enter the following information:
a. Name: Name of the integration
b. Upload Logo: Optional logo for the integration.
c. GateWay Profiles: Select a gateway management profile to associate with the client.
- Click Install. The Integration page displays the installed integration.
Configure the integration
- In CONFIGURATION section, click +Add.
- On Create Adapter Configuration, enter:
- Name: Configuration name.
- IP Address/Host Name: IP address or host name of the target.
- Notification Alerts: Select TRUE or FALSE.
- By default False is selected.
- If you select TRUE, App will handle Critical/Recovery failure alert notifications for Connectivity, Authentication Exceptions.
- From the SSH Credentials section, select Custom and enter Username and Password. These credentials are required to communicate with the target (cluster).
- From the VCS Credentials section, select Custom and enter Username and Password. These credentials are required to fetch cluster related information.
- From the Resource Types & Metrics section, select the metrics and configure for availability and alert conditions, for Cluster & Server.
- In the Discovery Schedule section, configure how frequently the discovery action should trigger. Select Recurrence Pattern to add one of the following patterns:
- In the Monitoring Schedule section, configure how frequently the monitoring action should trigger.
- Click Save.
After saving the configuration, the resources are discovered and monitoring is done as specified in the configuration profile.
The configuration is saved and displayed on the page.
You can perform the actions manually, like Discovery, Monitoring or even Disable the configuration.
The discovered resource(s) are displayed in the Infrastructure page under “Cluster”, with Native Resource Type as Veritas Cluster.
The cluster nodes are displayed under Components:
Resource Type: Cluster
|Metric Names||Metric Unit||Metric Description|
|veritas_cluster_group_State||Veritas cluster service group status on each node. Possible values 0-OFFLINE, 1-ONLINE, 2-FAULTED, 3-PARTIAL, 4-STARTING, 5-STOPPING, 6-MIGRATING, 7-OFFLINE|FAULTED, 8-OFFLINE|STARTING, 9-PARTIAL|FAULTED, 10-PARTIAL|STARTING, 11-PARTIAL|STOPPING, 12-ONLINE|STOPPING|
|veritas_cluster_group_Status||Veritas cluster service group status. Possible values 0 - Service group not online on any cluster node, 1 - Service group online on cluster node.|
|veritas_cluster_node_State||Veritas cluster node's status. Possible values 0-RUNNING, 1-ADMIN_WAIT, 2-CURRENT_DISCOVER_WAIT, 3-CURRENT_PEER_WAIT, 4-EXITING, 5-EXITED, 6-EXITING_FORCIBLY, 7-FAULTED, 8-INITING, 9-LEAVING, 10-LOCAL_BUILD, 11-REMOTE_BUILD, 12-STALE_ADMIN_WAIT, 13-STALE_DISCOVER_WAIT, 14-STALE_PEER_WAIT, 15-UNKNOWN|
|veritas_cluster_resource_State||Veritas cluster resource status on each node. Possible values 0-OFFLINE, 1-ONLINE, 2-FAULTED, 3-PARTIAL, 4-STARTING, 5-STOPPING, 6-MIGRATING, 7-OFFLINE|FAULTED, 8-OFFLINE|STARTING, 9-PARTIAL|FAULTED, 10-PARTIAL|STARTING, 11-PARTIAL|STOPPING, 12-ONLINE|STOPPING|
|veritas_cluster_resource_Status||Veritas cluster resource status. Possible values 0 - Resource state is not online on any cluster node, 1 - Resource state in online on any cluster node.|
|veritas_cluster_group_failover_Status||Veritas cluster service group failover status. Possible values 0 - No change. 1 - Cluster group change from one node to another due to failover. 2 - The specific cluster group is not online on any cluster node.|
|veritas_cluster_service_status_LLT||Low latency transport status, used for communication between nodes in the cluster. Possible values are 1-Active, 0-Inactive||veritas_cluster_service_status_GAB||Group membership and Atomic Broadcast service status, used for creating membership between all the nodes. Possible values 1-Active, 0-Inactive.|
|veritas_cluster_service_status_Fencing||Fencing service status. Possible values 1-Active, 0-Inactive|
|veritas_cluster_highAvailability_daemon_Status||High availability daemon status, main VCS engine which manages the agents and service groups. Possible values 1-Active, 0-Inactive|
|veritas_cluster_highAvailabilityCompanion_daemon_Status||High availability companion daemon ( hashadow) status. Possible values 1-Active, 0-Inactive|
|veritas_cluster_resourceAgent_daemon_Status||Resource agent daemon status. Possible values 1-Active, 0-Inactive|
|veritas_cluster_clusterMgmt_daemon_Status||Web console cluster management daemon status. Possible values 1-Active, 0-Inactive|
|veritas_cluster_volumeManager_daemon_Status||Volume manager daemon status, manages disk configurations at veritas level. Possible values 1-Active, 0-Inactive|
|veritas_cluster_RunningMode||Veritas cluster running mode of the configuration(/etc/VRTSvcs/conf/config/main.cf). Possible values 1-ReadOnly,0-Writemode|
|veritas_cluster_running_NodeCount||count||Count of the running cluster nodes at that instance|
|veritas_cluster_node_Health||%||Cluster health - percentage of running nodes|
|veritas_cluster_system_os_Uptime||m||Time lapsed since last reboot in minutes|
|veritas_cluster_system_cpu_Load||Monitors the system's last 1min, 5min and 15min load. It sends per cpu core load average.|
|veritas_cluster_system_cpu_UsageStats||%||Monitors cpu time in percentage spent in various program spaces. User - The processor time spent running user space processes System - The amount of time that the CPU spent running the kernel. IOWait - The time the CPU spends idle while waiting for an I/O operation to complete Idle - The time the processor spends idle Steal - The time virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine. Kernal Time Total Time|
|veritas_cluster_system_disk_inode_Utilization||%||This monitor is to collect DISK Inode metrics for all physical disks in a server.|
|veritas_cluster_system_disk_FreeSpace||GB||Monitors the Free Space usage in GB|
|veritas_cluster_system_disk_UsedSpace||GB||Monitors disk used space in GB|
|veritas_cluster_system_disk_Utilization||%||Monitors disk utilization in percentage|
|veritas_cluster_system_cpu_Utilization||%||The percentage of elapsed time that the processor spends to execute a non-Idle thread(This doesn't includes CPU steal time)|
|veritas_cluster_system_memory_UsedSpace||GB||Physical and virtual memory usage in GB|
|veritas_cluster_system_memory_Utilization||%||Physical and virtual memory usage in percentage.|
|veritas_cluster_system_network_interface_OutTraffic||Kbps||Monitors Out traffic of each interface for linux Devices|
|veritas_cluster_system_network_interface_InDiscards||psec||Monitors Network in discards of each interface for linux Devices|
|veritas_cluster_system_network_interface_OutPackets||packets/sec||Monitors Out packets of each interface for linux Devices|
|veritas_cluster_system_network_interface_OutErrors||Errors per Sec||Monitors network out errors of each interface for linux Devices|
|veritas_cluster_system_network_interface_OutDiscards||psec||Monitors network Out Discards of each interface for linux Devices|
|veritas_cluster_system_network_interface_InPackets||packets/sec||Monitors in Packets of each interface for linux Devices|
|veritas_cluster_system_network_interface_InErrors||Errors per Sec||Monitors network in errors of each interface for linux Devices|
|veritas_cluster_system_network_interface_InTraffic||Kbps||Monitors In traffic of each interface for linux Devices|
|veritas_cluster_lltLinks_State||Low latency transport link status on each node. Possible values are 0-CONNWAIT, 1-OPEN|
Resource Type: Cluster Nodes
|Metric Names||Metric Unit||Metric Description|
|veritas_cluster_node_lltInterface_Status||Low latency transport interface status on each node. Possible values are 0-DOWN,1-UP|
|veritas_cluster_node_system_os_Uptime||m||Time lapsed since last reboot in minutes|
|veritas_cluster_node_system_cpu_Load||Monitors the system's last 1min, 5min and 15min load. It sends per cpu core load average.|
|veritas_cluster_node_system_cpu_UsageStats||%||Monitors cpu time in percentage spent in various program spaces. User - The processor time spent running user space processes System - The amount of time that the CPU spent running the kernel. IOWait - The time the CPU spends idle while waiting for an I/O operation to complete Idle - The time the processor spends idle Steal - The time virtual CPU has spent waiting for the hypervisor to service another virtual CPU running on a different virtual machine. Kernel Time Total Time|
|veritas_cluster_node_system_disk_inode_Utilization||%||This monitor is to collect DISK Inode metrics for all physical disks in a server.|
|veritas_cluster_node_system_disk_FreeSpace||GB||Monitors the Free Space usage in GB|
|veritas_cluster_node_system_disk_UsedSpace||GB||Monitors disk used space in GB|
|veritas_cluster_node_system_disk_Utilization||%||Monitors disk utilization in percentage|
|veritas_cluster_node_system_cpu_Utilization||%||The percentage of elapsed time that the processor spends to execute a non-Idle thread(This doesn't includes CPU steal time)|
|veritas_cluster_node_system_memory_UsedSpace||GB||Physical and virtual memory usage in GB|
|veritas_cluster_node_system_memory_Utilization||%||Physical and virtual memory usage in percentage.|
|veritas_cluster_node_system_network_interface_OutTraffic||Kbps||Monitors Out traffic of each interface for linux Devices|
|veritas_cluster_node_system_network_interface_InDiscards||psec||Monitors Network in discards of each interface for linux Devices|
|veritas_cluster_node_system_network_interface_OutPackets||packets/sec||Monitors Out packets of each interface for linux Devices|
|veritas_cluster_node_system_network_interface_OutErrors||Errors per Sec||Monitors network out errors of each interface for linux Devices|
|veritas_cluster_node_system_network_interface_OutDiscards||psec||Monitors network Out Discards of each interface for linux Devices|
|veritas_cluster_node_system_network_interface_InPackets||packets/sec||Monitors in Packets of each interface for linux Devices|
|veritas_cluster_node_system_network_interface_InErrors||Errors per Sec||Monitors network in errors of each interface for linux Devices|
|veritas_cluster_node_system_network_interface_InTraffic||Kbps||Monitors In traffic of each interface for linux Devices.|
Supported product versions: Supported versions are Veritas infoscale 7.4.2
Risks, Limitations & Assumptions
- As of now supporting only Linux based Veritas failover cluster
- When we add two configurations with same end device details (like IP, credentials), we might observe gaps in the graphs due to internal VCS login and logouts parallely on the same device.
- Component level threshold configuration is not possible.
- Resource level metric threshold customization and frequency setting are not possible.
- Usability issues in app configuration page while adding/editing.
- Optional configuration parameters cannot be defined.
- App upgrade is manual process without version change.