# Pre-processing HTTP Logs for monitoring in ClickStack

Using the **Process HTTP Logs for ClickHouse** blueprint, you can normalize HTTP fields, mask sensitive data, reduce high-cardinality identifiers, and deduplicate errors, preparing your logs for cost-effective analytics without sacrificing security.

{% embed url="<https://www.youtube.com/watch?v=XT4GHDHZVUY>" %}

### Overview

The HTTP logs blueprint implements a comprehensive pipeline that processes web server and application logs before they reach ClickHouse. It handles common challenges in HTTP log analytics:

* **High-Cardinality Reduction**: Request IDs, session tokens, and client IPs create cardinality explosions in column stores. The blueprint removes these identifiers while preserving queryable information like HTTP method, status code, and route.
* **Health Check Filtering**: Kubernetes readiness probes and application health endpoints generate massive log volume. The blueprint filters these routine checks to focus analytics on meaningful traffic.
* **Data Masking**: Authorization headers, API keys, and PII in query parameters are redacted before storage, ensuring compliance with security policies.
* **Log Volume Reduction**: Successful HTTP 2xx responses are sampled at 50%, while repeated errors are deduplicated within 30-second windows to prevent alert fatigue and reduce storage costs.

### Bindplane Configuration

To implement this blueprint in your Bindplane deployment:

1. Navigate to [Blueprints](https://app.bindplane.com/p/01J06XSD7F4KT3D0XDE2VQDAR5/blueprints) and choose the **Process HTTP Logs for ClickHouse** Blueprint. Save it to your Library.
2. Open the **Processors** section in any of your Bindplane configurations.
3. Modify the Blueprint to fit your specific dataset and requirements.
4. Ensure that the data is formatted correctly by comparing the [Snapshot](https://docs.bindplane.com/feature-guides/snapshots) you see to the [Live Preview](https://docs.bindplane.com/feature-guides/live-preview), and validating data pre-processing looks good.
5. Save your changes and rollout the configuration update to production.

#### Key Processing Steps

The blueprint applies the following transformations in order:

**Parsing**: JSON HTTP log bodies are parsed into structured attributes, enabling field-level processing downstream.

**Filtering**: Health check endpoints (e.g., `/health`, `/healthz`, `/readiness`, `/ping`, `/metrics`) are automatically excluded. Debug and trace-level logs are dropped unless explicitly marked with `debug.keep="true"`.

**Data Masking**: Sensitive fields are redacted with asterisks:

* Authorization headers and API keys
* Passwords and tokens
* Cookies and session identifiers
* Client IP addresses

**Cardinality Reduction**: High-cardinality fields are removed to prevent storage explosion:

* Request IDs and correlation IDs
* Session IDs and user identifiers
* User agent strings and client addresses
* Trace IDs and span IDs (when trace correlation isn't required)

**Normalization**: Missing HTTP fields receive sensible defaults:

* `http.method` defaults to `UNKNOWN`
* `http.status_code` defaults to `0`
* `service.name` defaults to `unknown`
* `severity_text` defaults to `INFO`

**Sampling and Deduplication**: Success logs (2xx status) are sampled at 50% to reduce volume. Error logs (4xx/5xx) and high-severity events are deduplicated within 30-second windows, collapsing repeated errors into single entries with an `error_count` field.

**Batching**: Logs are batched in groups of up to 10,000 following best practice recommendations for [Clickhouse ingestion.](https://clickhouse.com/docs/best-practices/selecting-an-insert-strategy#batch-inserts-if-synchronous)

### Customizing the Blueprint

To adjust the blueprint's behavior for your use case:

* **Disable Health Check Filtering**: Remove the "Filter Health Checks" processor if you need complete visibility into all endpoint traffic
* **Preserve Trace IDs**: Set the `keep.trace.context` attribute to `"true"` on logs that need trace correlation
* **Adjust Sampling Rate**: Modify the sampling processor's `drop_ratio` parameter (currently 0.50) to change the percentage of 2xx logs retained
* **Extend Masking Rules**: Add additional patterns to the redaction rules if your logs contain custom sensitive fields

### Integration with ClickHouse

The blueprint's output is optimized for ClickHouse's OTLP logs schema. Ensure your ClickHouse exporter is configured with:

* **Batch size**: 5000-10000 rows (matching the blueprint's batch processor)
* **Connection timeout**: At least 10s to accommodate network latency
* **Retry logic enabled** for transient failures

Logs flow through the blueprint with semantic preservation, allowing queries like:

```sql
SELECT http.method, COUNT() as requests
FROM otel_logs
WHERE SeverityText != 'TRACE'
GROUP BY http.method
```

### Monitoring and Troubleshooting

Monitor the following metrics to ensure healthy operation:

* **Log Drop Rate**: The health check filter should exclude 10-30% of traffic in typical deployments
* **Error Deduplication**: Check that error logs are collapsing (watch `error_count` distribution)
* **Batch Sizes**: Monitor average batch sizes to confirm logs are batching efficiently

If you notice excessive log loss, verify that critical endpoints aren't matching the health check filter pattern. Use the `debug.keep="true"` attribute to preserve specific logs for troubleshooting.

{% hint style="info" %}
**NOTE**

This Blueprint has been tested against standard data patterns. You may need to adjust the configuration to match your specific data format.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bindplane.com/how-to-guides/partner-integrations/clickhouse/pre-processing-http-logs-for-monitoring-in-clickstack.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
