Introduction

Trace proxy metrics typically provide insights into the performance, behavior, and health of a system’s distributed tracing component. This document provides information about the various trace metrics monitored and reported by our tracing system.

Trace Proxy Metrics

MetricDescription
trace_operations_latencyThis metric captures the span latency in microseconds (µs) categorized by service, operation, and app. This metric is presented as a Prometheus histogram with predefined buckets specified as {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}. The histogram allows for a distribution analysis of latency values, providing insights into the performance characteristics of various services, operations, and applications within the traced system.
trace_root_operation_latencyThis metric measures the latency in microseconds (µs) for the root spans in a system. This latency is categorized by service, operation, and app. The metric is represented as a Prometheus histogram with predefined buckets set at {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}. This histogram structure allows for a detailed analysis of the distribution of latency values for the root spans across different quantiles, providing insights into the overall performance of the system at the root span level.
trace_acceptedThis metric serves as an indicator that a new trace has been successfully added to the collector's cache. This metric is useful for monitoring and keeping track of the acceptance or ingestion of traces within the system.
trace_operations_latency_msThis metric denotes the time difference, measured in milliseconds (ms), between the start and end times of a span for each trace operation. This metric provides insights into the duration or latency of individual trace operations within a trace.
trace_operations_failedThis metric represents the number of error events occurring in spans for each trace operation within a trace. This metric provides a count of the instances where an error has been identified or logged during the execution of individual trace operations.
trace_operations_succeededThis metric indicates the number of succeeded events in spans for each trace operation within a trace. This metric counts instances where the execution of individual trace operations was successful without encountering errors.
trace_spans_count_totalRepresents the total count of spans within a trace. This metric provides a numerical value indicating the overall number of spans that make up a particular trace.
trace_root_operation_latency_msRefers to the time difference, measured in milliseconds (ms), between the start and end times of a root span for each trace operation. This metric focuses specifically on the latency associated with the root span of a trace, providing insights into the duration of the entire trace operation.
trace_root_spanRepresents the number of root spans within a particular operation. Monitoring the "trace_root_span" metric can provide insights into the number of distinct operations or workflows initiated within a system, as each root span often corresponds to an independent unit of work or transaction.
trace_spans_countIndicates the count of total spans within a trace for each operation. Monitoring the "trace_spans_count" metric for each operation provides information on the total number of spans associated with individual operations.
trace_root_operations_failedRepresents the number of error events occurring in root spans for each trace operation within a trace. This metric specifically focuses on errors encountered at the root span level, providing insights into the health and reliability of the initial or parent spans within traced operations.
trace_operation_errorDefined as the ratio of the number of failed trace operations (trace_operation_failed) to the total count of spans within a trace (trace_spans_count). This ratio provides a measure of the proportion of trace operations that have encountered errors relative to the total number of spans in the trace.
trace_response_http_statusRepresents the total count of requests categorized based on their HTTP status codes within a traced system. This metric provides a breakdown of the number of requests corresponding to different HTTP status codes, allowing for the monitoring and analyzing the distribution of responses.
trace_response_grpc_statusRepresents the total count of requests categorized based on their GRPC (Google Remote Procedure Call) status codes within a traced system. In GRPC, status codes are used to indicate the success or failure of an RPC (Remote Procedure Call).
trace_apdex_latencyThis metric establishes buckets according to the configured apdex threshold for latency in milliseconds. As traces are ingested, they categorize into these buckets based on their latency values, subsequently increasing the counter associated with each bucket. The counter for a specific bucket can be accessed through the auto-generated metric trace_apdex_latency_bucket{le="< latency>"}.

Trace Metrics

MetricDescription
trace_duration_msRepresents the processing time spent by a span in the trace proxy, measured in milliseconds.
trace_send_droppedIndicates the number of traces that were dropped by the sampler. The mentioned scenario involves a dry run mode where, when enabled, the metric "trace_send_kept" increments for each trace that is sent, while "trace_send_dropped" remains 0. This configuration reflects that all traces are sent to Opsramp during the dry run, and none are dropped by the sampler.
trace_send_keptIndicates the number of traces that are sent after applying the sampling rule. In the described scenario with dry run mode enabled, the metric "trace_send_kept" increments for each trace sent, reflecting that all traces are being sent to Opsramp during the dry run. Meanwhile, the metric "trace_send_dropped" remains 0, indicating that no traces are being dropped by the sampler.
trace_send_ejected_fullIndicates the number of traces that are sent when the trace capacity is greater than the cache capacity, based on this condition.
trace_send_ejected_memsizeIndicates the number of traces that cannot be retained within the existing cache due to memory size constraints. In response to this condition, the system puts the traces that cannot be kept in the current cache into a new cache, and they are subsequently sent.
trace_send_expiredIndicates the number of traces that are sent when the trace timeout is completed, based on this condition.
trace_send_got_rootIndicates the number of traces that have a root span and are sent based on this condition.
trace_send_has_rootRepresents the count of spans that are identified as root spans within a trace. This count indicates how many spans within a set of traces have been designated as root spans.
trace_send_no_rootRepresents the count of spans within a trace that are not identified as root spans.
trace_sent_cache_hitIndicates that the trace proxy has received a span belonging to a trace that had already been sent. In this scenario, the trace proxy checks the sampling decision for the trace. If the trace has already been sent, the trace proxy may either forward the span immediately to Opsramp or drop the span, depending on the implemented sampling strategy.

Collector Metrics

MetricDescription
collector_cache_buffer_overrunThis metric is a value that, ideally should remain zero. An increase in this value might indicate a potential issue and could suggest the need to increase the size of the collector's circular buffer. The size of this circular buffer is typically configured using the "CacheCapacity" field. Even if this metric is increasing, it does not necessarily indicate that the cache is full. This situation may occur when the "collector_cache_entries" values (representing the number of entries in the cache) remain low in comparison to the configured "collect_cache_capacity".
collector_cache_capacityRepresents the configured capacity of the collector's cache. It provides information about the total size or capacity of the circular buffer that is used to temporarily store traces before they are processed or sent. you can use "collector_cache_capacity" with the "collect_cache_entries" metric to assess how full the cache is getting over time.
collector_cache_entriesProvides various statistical measures (average, maximum, minimum, and percentile values such as p50, p95, and p99) that collectively indicate how full the collector's cache is over time. This metric reflects the number of records or traces present in the cache at different points in time.
collector_cache_sizeRepresents the length or size of a circular buffer that currently stores traces in a tracing system. This circular buffer serves as a temporary storage mechanism for traces before they are further processed, analyzed, or sent to a destination.
collector_incoming_queueRecords various statistical measures (average, maximum, minimum, and percentile values such as p50, p95, and p99) to indicate how full the queue of spans that were received from outside of the trace proxy and need to be processed by the collector.
collector_peer_queueRecords various statistical measures (average, maximum, minimum, and percentile values such as p50, p95, and p99) to indicate how full the queue of spans that were received from other trace proxy peers and need to be processed by the collector.

Routing Metrics

MetricDescription
incoming_router_batchRepresent the number of times the trace proxy's batch event processing endpoint is hit by master instance of tracing proxy.
peer_router_batchRepresent the number of times the trace proxy's batch event processing endpoint is hit by peer instance of trace proxy.
incoming_router_droppedRepresent the number of times the trace proxy fails to add new spans to a receive buffer when processing new events from the application to master instance of trace proxy.
peer_router_droppedRepresent the number of times the trace proxy fails to add new spans to a receive buffer when processing new events from the master instance of trace proxy.
incoming_router_eventRepresent the count of times the trace proxy's single event processing endpoint is hit by master instance of tracing proxy.
peer_router_eventRepresent the count of times the trace proxy's single event processing endpoint is hit by peer instance of trace proxy.
incoming_router_nonspanRepresent the count of times the trace proxy's router accepts other non-span events that are not part of a trace from the application to master instance of a trace proxy.
peer_router_nonspanRepresent the count of times the trace proxy's router accepts other non-span events that are not part of a trace from the master instance of trace proxy.
incoming_router_peerRepresent the count of traces that are routed into the master instance of trace proxy from application.
peer_router_peerRepresent the count of traces that are routed into the other instance of trace proxy from master instance of trace proxy.
incoming_router_proxiedRepresent the count of traces that are routed into the master instance of trace proxy from application and have successfully reached the proxy.
peer_router_proxiedRepresent the count of traces that are routed into the other instance of trace proxy from master instance trace proxy peer and have successfully reached the proxy.
incoming_router_spanRepresent the count of events that the trace proxy accepts from applications and identifies as part of a trace, commonly referred to as spans.
peer_router_spanRepresent the count of events that the trace proxy accepts from a master instance of a trace proxy and identifies as part of a trace, commonly referred to as spans.

Transmission Metrics

MetricDescription
upstream_enqueue_errorsRepresent the count of spans that encountered errors while being dispatched to OpsRamp environment.
peer_enqueue_errorsRepresent the count of spans that encountered errors while being dispatched to another instance of trace proxy from master instance of a trace proxy.
upstream_response_errorsRepresent the count of spans that received an error response or a StatusCode greater than 202 when attempting to hit upstream addresses.
peer_response_errorsRepresent the count of spans that received an error response or a StatusCode greater than 202 when attempting to peer addresses from master instance of trace proxy.
upstream_response_20xRepresent the count of spans that received a successful response (2xx status code) and did not encounter any errors while hitting upstream addresses.
peer_response_20xRepresent the count of spans that received a successful response (2xx status code) and did not encounter errors while hitting peer addresses from master instance of trace proxy.

Sampling Metrics

MetricDescription
dynsampler_num_droppedRepresent the count of traces that are dropped due to dynamic sampling mechanisms.
rulessampler_num_droppedRepresent the count of traces that are dropped due to rules-based sampling mechanisms.
dynsampler_num_keptRepresent the count of traces that are not dropped due to dynamic sampling mechanisms.
rulessampler_num_keptRepresent the count of traces that are not dropped due to rules-based sampling mechanisms.
dynsampler_sample_rateRecords various statistical measures (average, maximum, minimum, and percentile values such as p50, p95, and p99) of the sample rate reported by the configured dynamic sampler.

Cuckoo Cache Metrics

MetricDescription
cuckoo_current_capacityRepresents the dropped size of the cuckoo cache as configured in the configuration section.
cuckoo_future_load_factorRepresents the fraction of slots that are occupied in the future filter of the cuckoo cache.
cuckoo_current_load_factorRepresents the fraction of slots that are occupied in the current filter of the cuckoo cache.

Note: Additional Process and Go metrics prefixed with process_ and go_ are employed to assess the health of the trace proxy.