# Pre-processing HTTP Logs for monitoring in ClickStack

Using the **Process HTTP Logs for ClickHouse** blueprint, you can normalize HTTP fields, mask sensitive data, reduce high-cardinality identifiers, and deduplicate errors, preparing your logs for cost-effective analytics without sacrificing security.

{% embed url="<https://www.youtube.com/watch?v=XT4GHDHZVUY>" %}

### Overview

The HTTP logs blueprint implements a comprehensive pipeline that processes web server and application logs before they reach ClickHouse. It handles common challenges in HTTP log analytics:

* **High-Cardinality Reduction**: Request IDs, session tokens, and client IPs create cardinality explosions in column stores. The blueprint removes these identifiers while preserving queryable information like HTTP method, status code, and route.
* **Health Check Filtering**: Kubernetes readiness probes and application health endpoints generate massive log volume. The blueprint filters these routine checks to focus analytics on meaningful traffic.
* **Data Masking**: Authorization headers, API keys, and PII in query parameters are redacted before storage, ensuring compliance with security policies.
* **Log Volume Reduction**: Successful HTTP 2xx responses are sampled at 50%, while repeated errors are deduplicated within 30-second windows to prevent alert fatigue and reduce storage costs.

### Bindplane Configuration

To implement this blueprint in your Bindplane deployment:

1. Navigate to [Blueprints](https://app.bindplane.com/p/01J06XSD7F4KT3D0XDE2VQDAR5/blueprints) and choose the **Process HTTP Logs for ClickHouse** Blueprint. Save it to your Library.
2. Open the **Processors** section in any of your Bindplane configurations.
3. Modify the Blueprint to fit your specific dataset and requirements.
4. Ensure that the data is formatted correctly by comparing the [Snapshot](https://docs.bindplane.com/feature-guides/snapshots) you see to the [Live Preview](https://docs.bindplane.com/feature-guides/live-preview), and validating data pre-processing looks good.
5. Save your changes and rollout the configuration update to production.

#### Key Processing Steps

The blueprint applies the following transformations in order:

**Parsing**: JSON HTTP log bodies are parsed into structured attributes, enabling field-level processing downstream.

**Filtering**: Health check endpoints (e.g., `/health`, `/healthz`, `/readiness`, `/ping`, `/metrics`) are automatically excluded. Debug and trace-level logs are dropped unless explicitly marked with `debug.keep="true"`.

**Data Masking**: Sensitive fields are redacted with asterisks:

* Authorization headers and API keys
* Passwords and tokens
* Cookies and session identifiers
* Client IP addresses

**Cardinality Reduction**: High-cardinality fields are removed to prevent storage explosion:

* Request IDs and correlation IDs
* Session IDs and user identifiers
* User agent strings and client addresses
* Trace IDs and span IDs (when trace correlation isn't required)

**Normalization**: Missing HTTP fields receive sensible defaults:

* `http.method` defaults to `UNKNOWN`
* `http.status_code` defaults to `0`
* `service.name` defaults to `unknown`
* `severity_text` defaults to `INFO`

**Sampling and Deduplication**: Success logs (2xx status) are sampled at 50% to reduce volume. Error logs (4xx/5xx) and high-severity events are deduplicated within 30-second windows, collapsing repeated errors into single entries with an `error_count` field.

**Batching**: Logs are batched in groups of up to 10,000 following best practice recommendations for [Clickhouse ingestion.](https://clickhouse.com/docs/best-practices/selecting-an-insert-strategy#batch-inserts-if-synchronous)

### Customizing the Blueprint

To adjust the blueprint's behavior for your use case:

* **Disable Health Check Filtering**: Remove the "Filter Health Checks" processor if you need complete visibility into all endpoint traffic
* **Preserve Trace IDs**: Set the `keep.trace.context` attribute to `"true"` on logs that need trace correlation
* **Adjust Sampling Rate**: Modify the sampling processor's `drop_ratio` parameter (currently 0.50) to change the percentage of 2xx logs retained
* **Extend Masking Rules**: Add additional patterns to the redaction rules if your logs contain custom sensitive fields

### Integration with ClickHouse

The blueprint's output is optimized for ClickHouse's OTLP logs schema. Ensure your ClickHouse exporter is configured with:

* **Batch size**: 5000-10000 rows (matching the blueprint's batch processor)
* **Connection timeout**: At least 10s to accommodate network latency
* **Retry logic enabled** for transient failures

Logs flow through the blueprint with semantic preservation, allowing queries like:

```sql
SELECT http.method, COUNT() as requests
FROM otel_logs
WHERE SeverityText != 'TRACE'
GROUP BY http.method
```

### Monitoring and Troubleshooting

Monitor the following metrics to ensure healthy operation:

* **Log Drop Rate**: The health check filter should exclude 10-30% of traffic in typical deployments
* **Error Deduplication**: Check that error logs are collapsing (watch `error_count` distribution)
* **Batch Sizes**: Monitor average batch sizes to confirm logs are batching efficiently

If you notice excessive log loss, verify that critical endpoints aren't matching the health check filter pattern. Use the `debug.keep="true"` attribute to preserve specific logs for troubleshooting.

{% hint style="info" %}
**NOTE**

This Blueprint has been tested against standard data patterns. You may need to adjust the configuration to match your specific data format.
{% endhint %}
