Monitoring
Recommended dashboard and alert policies to ensure stable cluster performance.
Alerts
Indexer rollover alert
Not enough rollovers are occurring (chunks being created). See No/low indexer rollovers.
Indexer rollover failure alert
Rollovers are failing to successfully complete. The most likely cause for this would be a failure to upload to S3, or potentially the inability to persist the metadata to Zookeeper.
Replica assignment capacity
This number indicates the available capacity of the cache nodes; positive is excess capacity, negative is shortage of capacity. To resolve this add additional cache capacity per instructions in Adding capacity, or reduce configured retention. Until this is resolved, results will be incomplete for queries.
Results visible to query
The amount of results being returned by Astra when querying for the last five minutes is below the alerting threshold. The first step should be to identify if this is just affecting recent results, or all results. If no results are available for query, regardless of timeframe, this indicates that the query nodes may be experiencing an issue, or Zookeeper may be having an issue.
If only recent results are missing, this would indicate an issue with the preprocessors or indexers. One potential source may be lag or trouble connecting to Kafka, which would prevent ingestion of new data.
Cached recovery tasks size alert
This indicates that too many pending recovery tasks exists in the queue, see Large amount of pending recovery tasks. Scaling the recovery node count would be the quickest way to resolve this, followed by understanding what caused the unexpected increase in tasks.