NVIDIA Bright Cluster Manager

Introduction

NVIDIA Bright Cluster Manager offers fast deployment, and end-to-end management for heterogeneous high-performance computing (HPC) and AI server clusters at the edge in the data center, and in multi/hybrid-cloud environments. It automates provisioning and administration for clusters ranging in size from a couple of nodes to hundreds of thousands. It also supports CPU-based and NVIDIA GPU-accelerated systems, and enables orchestration with Kubernetes.

Monitoring Use cases

In case of any threshold breach or unexpected metric behavior, based on configurations the device monitoring helps to collect the metric values with respect to time and sends alerts to the intended customer team to act up immediately.
It helps the customer with smooth functioning of business with minimal or zero downtime in case of any infrastructure related issues occurring.

Resource Hierarchy

NVIDIA BCM Head Node
NVIDIA BCM Virtual Node
NVIDIA BCM Physical Node
NVIDIA BCM Linux Server

Version History


Application Version	Bug fixes / Enhancements
5.0.6	Support adding the Root Resource UUID as a custom attribute for Nvidia Bright Cluster Manager app.
5.0.5	Activity Log, Get Latest Metrics and Debugging Changes for Nvidia Bright Cluster Manager.
5.0.4	Fix provided related to component level threshold alerting.
5.0.3	Fixed component level threshold alert enabling and disabling.
5.0.2	Resource Display Order changes and Sub-Category changes.

Click here to view the earlier version updates


Application Version	Bug fixes / Enhancements
5.0.1	Curated DashBoard, cache flush changes.
5.0.0	Added new Native Type NVIDIA BCM Linux Server and Metrics.
4.0.0	Added support for nfs-server metrics on NVIDIA BCM Head Node.
3.0.0	Added support for nfs-mount metrics on NVIDIA BCM Physical Node
2.0.0	Supported new metric "nvidia_bcm_cluster_smartHdaTemp".
1.0.0	Initial Support.