Alert policy templates
Alert policies use the following templates to define how the alert is triggered. The alert templates have been created using Prometheus expressions.
Alert templates
Alert channel failed
Last attempt to send alert notifications to channel '{{ $labels.source_name }}'
has failed. You need to try sending a test alert to obtain details.
last_over_time(ybp_alert_manager_channel_status{customer_uuid="$uuid"}[1d]) < 1
Alert notification failed
Last attempt to send alert notifications for customer 'customer name'
failed. You need to check YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_alert_manager_status{customer_uuid="$uuid"}[1d]) < 1
Alert query failed
Last alert query for customer 'customer name'
failed. YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_alert_query_status[1d]) < 1
Alert rules sync failed
Last alert rules synchronization for customer 'customer name'
has failed. YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_alert_config_writer_status[1d]) < 1
Backup templates
Backup deletion failure
Failed to delete $value
backups for customer 'customer name'
in last GC run. Check logs for more details.
last_over_time(ybp_delete_backup_failure{customer_uuid = "__customerUuid__"}[1d]) {{ query_condition }} 0
Backup failure
Last backup task for universe '$universe_name'
failed. You need to check the backup task result for details.
last_over_time(ybp_create_backup_status{universe_uuid = "$uuid"}[1d]) < 1
Backup schedule failure
Last attempt to run a scheduled backup for universe '$universe_name'
failed due to other backup or universe operation in progress.
last_over_time(ybp_schedule_backup_status{universe_uuid = "$uuid"}[1d]) < 1
PITR config failure
Last snapshot task for universe '$universe_name'
failed. To retry, check PITR configuration task result for more details.
min(ybp_pitr_config_status{universe_uuid = "__universeUuid__"}) {{ query_condition }} 1
DB templates
DB compaction overload
Database compaction rejections detected for universe '$universe_name'
.
sum by (node_prefix) (increase(majority_sst_files_rejections{node_prefix="$node_prefix"}[10m])) > 0
DB core files
Core files detected for universe '$universe_name'
on $value
T-Server instances.
ybp_health_check_tserver_core_files{universe_uuid="$uuid"} > 0
DB drive failure
TServer detected $value
drive failure for universe '$universe_name'
.
count by (universe_uuid) (drive_fault{universe_uuid="__universeUuid__",
export_type="tserver_export"}) {{ query_condition }} {{ query_threshold }}createForNewCustomer: true
DB error logs
Error logs detected for universe '$universe_name'
on $value
Master/TServer instance(s).
sum by (universe_uuid) ((ybp_health_check_node_master_error_logs{universe_uuid="__universeUuid__"} < bool 1) * ignoring (saved_name) (ybp_health_check_node_master_fatal_logs{universe_uuid="__universeUuid__"} == bool 1)) + sum by (universe_uuid) ((ybp_health_check_node_tserver_error_logs{universe_uuid="__universeUuid__"} < bool 1) * ignoring (saved_name) (ybp_health_check_node_tserver_fatal_logs{universe_uuid="__universeUuid__"} == bool 1)) {{ query_condition }} {{ query_threshold }}
DB fatal logs
Fatal logs have been detected for universe '$universe_name'
on $value
Master or T-Server instances.
sum by (universe_uuid) (ybp_health_check_node_master_fatal_logs{universe_uuid="$uuid"} < bool 1) + sum by (universe_uuid) (ybp_health_check_node_tserver_fatal_logs{universe_uuid="$uuid"} < bool 1) > 0
DB instance down
$value
database Master or T-Server instances are down for more than 15 minutes for universe '$universe_name'
.
count by (node_prefix) (
label_replace(
max_over_time(up{export_type=~"master_export|tserver_export",node_prefix="$node_prefix"}[15m]),
"exported_instance",
"$1",
"instance",
"(.*)"
)
<
1
and on (node_prefix, export_type, exported_instance)
(min_over_time(ybp_universe_node_function{node_prefix="$node_prefix"}[15m]) == 1)
)
>
0
DB instance restart
Universe '$universe_name'
Master or T-Server has restarted $value
times during last 30 minutes.
max by (node_prefix) (
changes(yb_node_boot_time{node_prefix="$node_prefix"}[30m])
and on (node_prefix)
(max_over_time(ybp_universe_update_in_progress{node_prefix="$node_prefix"}[31m]) == 0)
)
>
0
DB queues overflow
Database queues overflow has been detected for universe '$universe_name'
.
sum by (node_prefix) (increase(rpcs_queue_overflow{node_prefix="$node_prefix"}[10m]))
+
sum by (node_prefix) (increase(rpcs_timed_out_in_queue{node_prefix="$node_prefix"}[10m]))
>
1
DB memory overload
Database memory rejections have been detected for universe '$universe_name'
.
sum by (node_prefix) (increase(leader_memory_pressure_rejections{node_prefix="$node_prefix"}[10m]))
+
sum by (node_prefix) (
increase(follower_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])
)
+
sum by (node_prefix) (
increase(operation_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])
)
>
0
DB version mismatch
Version mismatch has been detected for universe '$universe_name'
for $value
Master or T-Server instances.
ybp_health_check_tserver_version_mismatch{universe_uuid="$uuid"}
+
ybp_health_check_master_version_mismatch{universe_uuid="$uuid"}
>
0
DB write/read test error
Test YSQL write/read operation failed on $value
nodes for universe '$universe_name'
.
count by (node_prefix) (yb_node_ysql_write_read{node_prefix="$node_prefix"} < 1)
DocDB cache miss percentage is high
DocDB cache miss percentage is high for universe '$universe_name'
. The current value is $value
%.
avg by (universe_uuid) (
sum by (exported_instance, universe_uuid) (
rate(rocksdb_block_cache_miss{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m])
)
/
(
sum by (exported_instance, universe_uuid) (
rate(rocksdb_block_cache_miss{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m])
)
+
sum by (exported_instance, universe_uuid) (
rate(rocksdb_block_cache_hit{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m])
)
)
)
*
100
{{ query_condition }} {{ query_threshold }}
DB node templates
DB node CPU usage
Average node CPU usage for universe '$universe_name'
is more than 90% on $value
nodes.
count by (node_prefix) (
(
100
-
(
avg by (node_prefix, instance) (
avg_over_time(
irate(node_cpu_seconds_total{job="node",mode="idle",node_prefix="$node_prefix"}[1m])[30m:]
)
)
*
100
)
)
>
90
)
DB node down
$value
database nodes are down for more than 15 minutes for universe '$universe_name'
.
count by (node_prefix) (
max_over_time(up{export_type="node_export",node_prefix="$node_prefix"}[15m]) < 1
)
>
0
DB node data disk usage
Node data disk usage for universe '$universe_name'
is above $threshold%
on $value
node(s).
count by (universe_uuid) (count by (universe_uuid, node_name) (100 - (sum without (saved_name) (node_filesystem_free_bytes{mountpoint=~"__mountPoints__",universe_uuid="__universeUuid__", fstype!="rootfs"}) / sum without (saved_name) (node_filesystem_size_bytes{mountpoint=~"__mountPoints__",
universe_uuid="__universeUuid__", fstype!="rootfs"}) * 100) {{ query_condition }} {{ query_threshold }}))
DB node file descriptors usage
Node file descriptors usage for universe '$universe_name'
is above 70% on $value
nodes.
count by (universe_uuid) (ybp_health_check_used_fd_pct{universe_uuid="$uuid"} > 70)
DB node OOM
More than one out of memory (OOM) kills have been detected for universe '$universe_name'
on $value
nodes.
count by (node_prefix) (yb_node_oom_kills_10min{node_prefix="$node_prefix"} > 1) > 0
DB node restart
Universe '$universe_name'
database node has restarted $value
times during last 30 minutes.
max by (node_prefix) (changes(node_boot_time{node_prefix="$node_prefix"}[30m])) > 0
DB node system disk usage
Node system disk usage for universe '$universe_name'
is above $threshold%
on $value
node(s).
count by (universe_uuid) (count by (universe_uuid, node_name) (100 - (sum without (saved_name) (node_filesystem_free_bytes{mountpoint=~"__systemMountPoints__",universe_uuid="__universeUuid__", fstype!="rootfs"}) / sum without (saved_name) (node_filesystem_size_bytes{mountpoint=~"__systemMountPoints__",universe_uuid="__universeUuid__", fstype!="rootfs"}) * 100) {{ query_condition }} {{ query_threshold }}))
Master and Tablets
Leaderless tablets
The tablet leader is missing for more than 5 minutes for $value
tablets in universe '$universe_name'
.
max by (node_prefix) (
count by (node_prefix, exported_instance) (
max_over_time(yb_node_leaderless_tablet{node_prefix="$node_prefix"}[5m])
)
>
0
)
Master leader missing
Master leader is missing for universe '$universe_name'
.
max by (node_prefix) (yb_node_is_master_leader{node_prefix="$node_prefix"}) < 1
Under-replicated master
Master is missing from Raft group or has follower lag higher than $threshold
seconds for universe '$universe_name'
.
(min_over_time((ybp_universe_replication_factor{universe_uuid='{{ $labels.universe_uuid }}'} - on(universe_uuid) count by(universe_uuid) (count by (universe_uuid, exported_instance) (follower_lag_ms{export_type="master_export", universe_uuid='{{ $labels.universe_uuid }}'})))[{{query_threshold }}s:]) > 0 or (max by(universe_uuid) (follower_lag_ms{export_type="master_export", universe_uuid='{{ $labels.universe_uuid }}'}) {{ query_condition }} ({{ query_threshold }} * 1000)))
Under-replicated tablets
$value
tablets remain under-replicated for more than 5 minutes in universe '$universe_name'
.
max by (node_prefix) (
count by (node_prefix, exported_instance) (
max_over_time(yb_node_underreplicated_tablet{node_prefix="$node_prefix"}[5m])
)
>
0
)
Tablet server average read latency is high
Average read latency of tablet server for universe '$universe_name'
is above $threshold%
ms. The current value is $value
milliseconds.
(
avg by (universe_uuid) (
rate(
rpc_latency_sum{export_type="tserver_export",server_type="yb_tserver",service_method="Read",service_type="TabletServerService",universe_uuid="__universeUuid__"}[5m]
)
)
)
/
(
avg by (universe_uuid) (
rate(
rpc_latency_count{export_type="tserver_export",server_type="yb_tserver",service_method="Read",service_type="TabletServerService",universe_uuid="__universeUuid__"}[5m]
)
)
*
1000
)
{{ query_condition }} {{ query_threshold }}
Tablet server average write latency is high
Average write latency of tablet server for universe '$universe_name'
is above $threshold%
ms. The current value is $value
milliseconds.
(
avg by (universe_uuid) (
rate(
rpc_latency_sum{export_type="tserver_export",server_type="yb_tserver",service_method="Write",service_type="TabletServerService",universe_uuid="__universeUuid__"}[5m]
)
)
)
/
(
avg by (universe_uuid) (
rate(
rpc_latency_count{export_type="tserver_export",server_type="yb_tserver",service_method="Write",service_type="TabletServerService",universe_uuid="__universeUuid__"}[5m]
)
)
*
1000
)
{{ query_condition }} {{ query_threshold }}
Resource templates
Clock skew
Maximum clock skew for universe '$universe_name'
is more than 500 milliseconds. The current value is $value
milliseconds.
max by (node_prefix) (max_over_time(hybrid_clock_skew{node_prefix="$node_prefix"}[10m])) / 1000
>
500
Health check error
Failed to perform health check for universe '$universe_name'
. You need to check YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_health_check_status{universe_uuid="$uuid"}[1d]) < 1
Health check notification error
Failed to issue health check notification for universe '$universe_name'
. You need to check Health notification settings and YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_health_check_notification_status{universe_uuid="$uuid"}[1d]) < 1
Inactive cronjob nodes
$value
nodes have inactive cronjob for universe '$universe_name'
.
ybp_universe_inactive_cron_nodes{universe_uuid = "$uuid"} > 0
Memory consumption
Average memory usage for universe '$universe_name'
nodes is above $threshold%
. Maximum value is $value
.
max by (universe_uuid) ((avg_over_time(node_memory_MemTotal_bytes{universe_uuid="__universeUuid__"}[10m])
- ignoring (saved_name) (avg_over_time(node_memory_Buffers_bytes{universe_uuid="__universeUuid__"}[10m]))
- ignoring (saved_name) (avg_over_time(node_memory_Cached_bytes{universe_uuid="__universeUuid__"}[10m]))
- ignoring (saved_name) (avg_over_time(node_memory_MemFree_bytes{universe_uuid="__universeUuid__"}[10m]))
- ignoring (saved_name) (avg_over_time(node_memory_Slab_bytes{universe_uuid="__universeUuid__"}[10m])))
/ ignoring (saved_name) (avg_over_time(node_memory_MemTotal_bytes{universe_uuid="__universeUuid__"}[10m])))
* 100 {{ query_condition }} {{ query_threshold }}
Metric collection failure
Failed to collect metrics for universe '$universe_name'
. You need to check YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_universe_metric_collection_status{universe_uuid = "__universeUuid__"}[1d]) {{ query_condition }} 1
Replication lag
Average replication lag for universe '$universe_name'
is above $threshold
milliseconds. Current value is $value
milliseconds.
max by (universe_uuid) (avg_over_time(async_replication_committed_lag_micros{universe_uuid="__universeUuid__"}[10m]) or avg_over_time(async_replication_sent_lag_micros{universe_uuid="__universeUuid__"}[10m])) / 1000 {{ query_condition }} {{ query_threshold }}
Universe OS outdated
More recent OS version is recommended for this universe. Consider running VM image upgrade for the nodes to incorporate security patches and address vulnerabilities.
ybp_universe_os_update_required{universe_uuid="__universeUuid__"} {{ query_condition }} {{ query_threshold }}
Increase in remote bootstraps
Increase in remote bootstraps detected for universe '$universe_name'
.
sum by (universe_uuid) (
increase(
rpc_latency_count{export_type="tserver_export",server_type="yb_consensus",service_method="StartRemoteBootstrap",service_type="ConsensusService",universe_uuid="__universeUuid__"}[5m]
)
)
{{ query_condition }} {{ query_threshold }}
Reactor delays are high
Reactor delays for universe '$universe_name'
is above $threshold%
ms. The current value is $value
milliseconds.
max by (universe_uuid) (
avg by (universe_uuid, saved_name) (
label_replace(
rate(rpc_incoming_queue_time_sum{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m]),
"saved_name",
"rpc_incoming_queue_time_count",
"saved_name",
"(.*)"
)
)
/
(
avg by (universe_uuid, saved_name) (
rate(
rpc_incoming_queue_time_count{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m]
)
)
*
1000
)
{{ query_condition }} {{ query_threshold }}
or
(
avg by (universe_uuid, saved_name) (
label_replace(
rate(
handler_latency_outbound_call_queue_time_sum{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m]
),
"saved_name",
"handler_latency_outbound_call_queue_time_count",
"saved_name",
"(.*)"
)
)
)
/
(
avg by (universe_uuid, saved_name) (
rate(
handler_latency_outbound_call_queue_time_count{export_type="tserver_export",universe_uuid="__universeUuid__"}[5m]
)
)
*
1000
)
{{ query_condition }} {{ query_threshold }}
)
RPC queue size is high
RPC queue size is high for universe '$universe_name'
.
max by (universe_uuid) (
min_over_time(
{export_type="tserver_export",saved_name=~"rpcs_in_queue_.*",universe_uuid="__universeUuid__"}[5m]
)
{{ query_condition }} {{ query_threshold }}
)
WAL cache size is high
WAL cache size is high for nodes '$node_name'
in universe '$universe_name'
. The current value is $value
MB for one of the nodes.
max by (universe_uuid) (
(
sum by (universe_uuid, node_name) (
log_cache_size{export_type="tserver_export",universe_uuid="__universeUuid__"}
)
)
/
1024
)
{{ query_condition }} {{ query_threshold }}
Security templates
Client to node cert expiry
Client to node certificate for universe '$universe_name'
expires in $value
days.
min by (node_name) (ybp_health_check_c2n_cert_validity_days{universe_uuid="$uuid"} < 30)
Client to node CA cert expiry
Client to node CA certificate for universe '$universe_name'
expires in $value
days.
min by (node_name) (ybp_health_check_c2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
Encryption at rest config expiry
Encryption at rest configuration for universe '$universe_name'
expires in $value
days.
ybp_universe_encryption_key_expiry_days{universe_uuid="$uuid"} < 3
Node to node cert expiry
Node to node certificate for universe '$universe_name'
expires in $value
days.
min by (node_name) (ybp_health_check_n2n_cert_validity_days{universe_uuid="$uuid"} < 30)
Node to node CA cert expiry
Node to node CA certificate for universe '$universe_name'
expires in $value
days.
min by (node_name) (ybp_health_check_n2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
Private access key permission status
Invalid permissions of private access key file for universe '$universe_name'
. You need to check YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_universe_private_access_key_status{universe_uuid = "__universeUuid__"}[1d]) {{ query_condition }} 1
SSH key expiry
SSH key for universe '$universe_name'
will expire in $value
days.
ybp_universe_ssh_key_expiry_day{universe_uuid="__universeUuid__"} {{ query_condition }} {{ query_threshold }}
SSH key rotation failure
Last SSH key rotation task for universe '$universe_name'
failed. To retry, check SSH key rotation task result.
last_over_time(ybp_ssh_key_rotation_status{universe_uuid = "__universeUuid__"}[1d]) {{ query_condition }} 1
YSQL ops and latency
DB YSQLSH connection
YSQLSH connection failure detected for universe '$universe_name'
on $value
TServer instance(s).
count by (universe_uuid) (yb_node_ysql_connect{universe_uuid="__universeUuid__"} < 1) {{ query_condition }} {{ query_threshold }}
New YSQL tables added
New YSQL tables are added to the source universe '$universe_name'
in the database with an existing xCluster configuration, but not added to the xCluster replication.
((count by (namespace_name, universe_uuid)(count by(namespace_name, table_id, universe_uuid)(rocksdb_current_version_sst_files_size{universe_uuid="__universeUuid__",table_type="PGSQL_TABLE_TYPE"}))) - count by(namespace_name, universe_uuid)(count by(namespace_name, universe_uuid, table_id)(async_replication_sent_lag_micros{universe_uuid="__universeUuid__",table_type="PGSQL_TABLE_TYPE"}))) {{ query_condition }} {{ query_threshold }}
Number of YSQL connections is high
Number of YSQL connections for universe '$universe_name'
is above $threshold
. Current value is $value
.
max by (universe_uuid) (max_over_time(yb_node_ysql_connections_count{universe_uuid="__universeUuid__"}[5m])) {{ query_condition }} {{ query_threshold }}
YSQL average latency is high
Average YSQL operations latency for universe '$universe_name'
is above $threshold
milliseconds. Current value is $value
milliseconds.
(sum by (universe_uuid, service_method)(rate(rpc_latency_sum{universe_uuid="__universeUuid__",export_type="ysql_export",server_type="yb_ysqlserver",service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transactions"}[5m])) / sum by (universe_uuid, service_method)(rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="ysql_export",server_type="yb_ysqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transactions"}[5m]))) {{ query_condition }} {{ query_threshold }}
YSQL P99 latency is high
YSQL P99 latency for universe '$universe_name'
is above $threshold
milliseconds. Current value is $value
milliseconds.
max by (universe_uuid) (rpc_latency{universe_uuid="__universeUuid__",server_type="yb_ysqlserver",service_type="SQLProcessor", service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|OtherStmts|Transactions",quantile="p99"}) {{ query_condition }} {{ query_threshold }}
YSQL throughput is high
Maximum throughput for YSQL operations for universe '$universe_name'
is above $threshold
milliseconds. Current value is $value
milliseconds.
sum by (service_method)(rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="ysql_export",server_type="yb_ysqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transactions"}[5m])) {{ query_condition }} {{ query_threshold }}
YCQL ops and latency
DB CQLSH connection
CQLSH connection failure has been detected for universe '$universe_name'
on $value
T-Server instances.
ybp_health_check_cqlsh_connectivity_error{universe_uuid="$uuid"} > 0
Number of YCQL connections is high
Number of YCQL connections for universe '$universe_name'
is above $threshold
. Current value is $value
.
max by (universe_uuid) (max_over_time(rpc_connections_alive{universe_uuid="__universeUuid__",export_type="cql_export"}[5m])) {{ query_condition }} {{ query_threshold }}
YCQL average latency is high
Average YSQL operations latency for universe '$universe_name'
is above $threshold
milliseconds. Current value is $value
milliseconds.
(sum by (service_method)(rate(rpc_latency_sum{universe_uuid="__universeUuid__",export_type="cql_export",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transaction"}[5m])) / sum by (service_method)(rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="cql_export",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transaction"}[5m]))) {{ query_condition }} {{ query_threshold }}
YCQL P99 latency is high
YCQL P99 latency for universe '$universe_name'
is above $threshold
milliseconds. Current value is $value
milliseconds.
max by (universe_uuid)(rpc_latency{universe_uuid="__universeUuid__",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|OtherStmts|Transaction",quantile="p99"}) {{ query_condition }} {{ query_threshold }}
YCQL throughput is high
Maximum throughput for YCQL operations for universe '$universe_name'
is above $threshold
milliseconds. Current value is $value
milliseconds.
sum by (universe_uuid, service_method) (rate(rpc_latency_count{universe_uuid="__universeUuid__",export_type="cql_export",server_type="yb_cqlserver", service_type="SQLProcessor",service_method=~"SelectStmt|InsertStmt|UpdateStmt|DeleteStmt|Transaction"}[5m])) {{ query_condition }} {{ query_threshold }}