# High Availability

### What Is High Availability (HA)?

Ensure telemetry collection and processing infrastructure works even if individual Collector instances fail.

### Why High Availability for the Collector?

* **Avoid data loss** from agent-mode collectors when exporting to a dead backend.
* **Ensure telemetry continuity** during rolling updates or infrastructure failures.
* **Enable horizontal scalability** for load-balancing traces, logs, and metrics.

{% hint style="info" %}
**NOTE**

Using **Agent-Gateway Architecture** is the recommended deployment pattern for high availability.
{% endhint %}

### Agent-Gateway Architecture

* Agent Collectors run on every host, container, or node.
* Gateway Collectors are centralized, scalable backend services receiving telemetry from agents.
* Each layer can be scaled independently and horizontally.
* For more on when to use agent-gateway architecture, see [Agents v. Gateways](https://docs.bindplane.com/~/revisions/computed_ZbjlSUvw6svhxVqnyLEy_e0db5f493e0ca00556b4d7b523ed53170f30d363/production-checklist/bindplane-otel-collector/agents-v.-gateways).

### Architecture

A typical high-availability OpenTelemetry Collector deployment consist of:

1. **Multiple Collector Instances**
   * Deployed across different availability zones/regions
   * Each instance capable of handling the full workload
   * Redundant storage for temporary data buffering
2. **Load Balancer**
   * Distributes incoming telemetry data
   * Health checks to detect collector availability
   * Session affinity for consistent routing
3. **Automatic Failover**
   * Occurs if a collector becomes unavailable
4. **Shared Storage Backend**
   * Persistent storage for collector state
   * Shared configuration management
   * Metrics and traces storage

### Sizing and Resource Requirements

View the [Sizing and Scaling](https://docs.bindplane.com/production-checklist/bindplane-otel-collector/sizing-and-scaling) page for a more in-depth guide.

#### Gateway Collector Requirements

**Minimum Configuration**:

* 2 collectors behind a load balancer
* 2 CPU cores per collector
* 8GB memory per collector
* 60GB usable space for persistent queue per collector

#### Throughput-Based Sizing

The following table shows the number of collectors needed based on expected throughput. This assumes each collector has 4 CPU cores and 16GB of memory:

| Telemetry Throughput | Logs / second | Collectors |
| -------------------- | ------------- | ---------- |
| 5 GB/m               | 250,000       | 2          |
| 10 GB/m              | 500,000       | 3          |
| 20 GB/m              | 1,000,000     | 5          |
| 100 GB/m             | 5,000,000     | 25         |

It's important to over-provision your collector fleet to provide fault tolerance. If one or more collector systems fail or are brought offline for maintenance, the remaining collectors must have enough available capacity to handle the telemetry throughput.

#### Scaling

1. Monitor these metrics to determine when to scale:
   * CPU utilization
   * Memory usage
   * Network throughput
   * Queue length
   * Error rates
2. Configure auto-scaling based on:
   * CPU utilization > 70%
   * Memory usage > 80%
   * Request rate per collector

#### Load Balancer Configuration

Configure a load balancer.

1. Health check endpoint: `/health`
2. Health check interval: 30 seconds
3. Unhealthy threshold: 3 failures
4. Healthy threshold: 2 successes

View the [Load Balancing Best Practices, here](https://docs.bindplane.com/production-checklist/resilience#load-balancing-best-practices).

### Resilience

View the [Resilience](https://docs.bindplane.com/production-checklist/bindplane-otel-collector/resilience) page for a more in-depth guide.

Configure:

1. [Batching](https://docs.bindplane.com/production-checklist/resilience#batch) - Aggregates telemetry signals before exporting them
2. [Retry](https://docs.bindplane.com/production-checklist/resilience#retry) - Retry sending telemetry batches when there is an error or a network outage.
3. [Persistent Queue](https://docs.bindplane.com/production-checklist/resilience#persistent-queuing) - Retries are stored in a sending queue on disk to guarantee persistence if a collector crashes.

#### Retry

For workloads that cannot afford to have telemetry dropped, consider increasing the `max_elapsed_time` significantly. Keep in mind that a large max elapsed time combined with a large backend outage will cause the collector to "buffer" a significant amount of telemetry to disk.

#### Persistent Queue

The sending queue has three important options:

* **Number of consumers**: Determines how many batches will be retried in parallel
* **Queue size**: Determines how many batches are stored in the queue
* **Persistent queuing**: Allows the collector to buffer telemetry batches to disk

### Monitoring and Maintenance

View the [Monitoring](https://docs.bindplane.com/production-checklist/bindplane-otel-collector/monitoring) page for a more in-depth guide.

#### Health Monitoring

1. Set up monitoring for:
   * Collector instance health
   * Load balancer health
   * Data throughput
   * Error rates
   * Resource utilization
2. Configure alerts for:
   * Collector failures
   * High latency
   * Error rate thresholds
   * Resource exhaustion

#### Monitoring the Collectors

To monitor collector logs, set up a Bindplane Collector source that will send log files from the Collector itself:

1. Add a "Bindplane Collector" source to your configuration
2. Configure the source with default settings
3. Push the configuration to your collectors
4. View the logs in your destination of choice

### Best Practices

1. **Resource Allocation**
   * Size collectors for peak load
   * Include buffer for traffic spikes
   * Monitor resource usage
2. **Network Configuration**
   * Use dedicated networks
   * Configure appropriate timeouts
   * Enable TLS for security
3. **Data Management**
   * Implement data buffering
   * Configure appropriate batch sizes
   * Set up retry policies
4. **Security**
   * Enable TLS encryption
   * Implement authentication
   * Use network policies
   * Regular security updates
5. **Load Balancing**
   * Configure health checks to ensure collectors are ready to receive traffic
   * Ensure even connection distribution among collectors
   * Support both TCP/UDP and HTTP/gRPC protocols

### Troubleshooting

#### Common Issues

1. **Load Balancer Issues**
   * Check health check configuration
   * Verify network connectivity
   * Review security groups/firewall rules
2. **Collector Failures**
   * Check resource utilization
   * Review error logs
   * Verify configuration
3. **Data Loss**
   * Check buffer configuration
   * Verify exporter settings
   * Review retry policies


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bindplane.com/production-checklist/bindplane-otel-collector/high-availability.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
