High Availability
Learn how to set up a highly available OpenTelemetry Collector deployment for production environments.
What Is High Availability (HA)?
Ensure telemetry collection and processing infrastructure works even if individual Collector instances fail.
Why High Availability for the Collector?
Avoid data loss from agent-mode collectors when exporting to a dead backend.
Ensure telemetry continuity during rolling updates or infrastructure failures.
Enable horizontal scalability for load-balancing traces, logs, and metrics.
Agent-Gateway Architecture
Agent Collectors run on every host, container, or node.
Gateway Collectors are centralized, scalable backend services receiving telemetry from agents.
Each layer can be scaled independently and horizontally.
Architecture
A typical high-availability OpenTelemetry Collector deployment consist of:
Multiple Collector Instances
Deployed across different availability zones/regions
Each instance capable of handling the full workload
Redundant storage for temporary data buffering
Load Balancer
Distributes incoming telemetry data
Health checks to detect collector availability
Session affinity for consistent routing
Automatic Failover
Occurs if a collector becomes unavailable
Shared Storage Backend
Persistent storage for collector state
Shared configuration management
Metrics and traces storage
Sizing and Resource Requirements
View the Sizing and Scaling page for a more in-depth guide.
Gateway Collector Requirements
Minimum Configuration:
2 collectors behind a load balancer
2 CPU cores per collector
8GB memory per collector
60GB usable space for persistent queue per collector
Throughput-Based Sizing
The following table shows the number of collectors needed based on expected throughput. This assumes each collector has 4 CPU cores and 16GB of memory:
5 GB/m
250,000
2
10 GB/m
500,000
3
20 GB/m
1,000,000
5
100 GB/m
5,000,000
25
It's important to over-provision your collector fleet to provide fault tolerance. If one or more collector systems fail or are brought offline for maintenance, the remaining collectors must have enough available capacity to handle the telemetry throughput.
Scaling
Monitor these metrics to determine when to scale:
CPU utilization
Memory usage
Network throughput
Queue length
Error rates
Configure auto-scaling based on:
CPU utilization > 70%
Memory usage > 80%
Request rate per collector
Load Balancer Configuration
Configure a load balancer.
Health check endpoint:
/health
Health check interval: 30 seconds
Unhealthy threshold: 3 failures
Healthy threshold: 2 successes
View the Load Balancing Best Practices, here.
Resilience
View the Resilience page for a more in-depth guide.
Configure:
Batching - Aggregates telemetry signals before exporting them
Retry - Retry sending telemetry batches when there is an error or a network outage.
Persistent Queue - Retries are stored in a sending queue on disk to guarantee persistence if a collector crashes.
Retry
For workloads that cannot afford to have telemetry dropped, consider increasing the max_elapsed_time
significantly. Keep in mind that a large max elapsed time combined with a large backend outage will cause the collector to "buffer" a significant amount of telemetry to disk.
Persistent Queue
The sending queue has three important options:
Number of consumers: Determines how many batches will be retried in parallel
Queue size: Determines how many batches are stored in the queue
Persistent queuing: Allows the collector to buffer telemetry batches to disk
Monitoring and Maintenance
View the Monitoring page for a more in-depth guide.
Health Monitoring
Set up monitoring for:
Collector instance health
Load balancer health
Data throughput
Error rates
Resource utilization
Configure alerts for:
Collector failures
High latency
Error rate thresholds
Resource exhaustion
Monitoring the Collectors
To monitor collector logs, set up a Bindplane Collector source that will send log files from the Collector itself:
Add a "Bindplane Collector" source to your configuration
Configure the source with default settings
Push the configuration to your collectors
View the logs in your destination of choice
Best Practices
Resource Allocation
Size collectors for peak load
Include buffer for traffic spikes
Monitor resource usage
Network Configuration
Use dedicated networks
Configure appropriate timeouts
Enable TLS for security
Data Management
Implement data buffering
Configure appropriate batch sizes
Set up retry policies
Security
Enable TLS encryption
Implement authentication
Use network policies
Regular security updates
Load Balancing
Configure health checks to ensure collectors are ready to receive traffic
Ensure even connection distribution among collectors
Support both TCP/UDP and HTTP/gRPC protocols
Troubleshooting
Common Issues
Load Balancer Issues
Check health check configuration
Verify network connectivity
Review security groups/firewall rules
Collector Failures
Check resource utilization
Review error logs
Verify configuration
Data Loss
Check buffer configuration
Verify exporter settings
Review retry policies
Last updated
Was this helpful?