Pre-processing HTTP Logs for monitoring in ClickStack

Optimizing HTTP and application logs for ClickHouse analytics requires balancing data completeness with storage efficiency.

Using the Process HTTP Logs for ClickHouse blueprint, you can normalize HTTP fields, mask sensitive data, reduce high-cardinality identifiers, and deduplicate errors, preparing your logs for cost-effective analytics without sacrificing security.

Overview

The HTTP logs blueprint implements a comprehensive pipeline that processes web server and application logs before they reach ClickHouse. It handles common challenges in HTTP log analytics:

  • High-Cardinality Reduction: Request IDs, session tokens, and client IPs create cardinality explosions in column stores. The blueprint removes these identifiers while preserving queryable information like HTTP method, status code, and route.

  • Health Check Filtering: Kubernetes readiness probes and application health endpoints generate massive log volume. The blueprint filters these routine checks to focus analytics on meaningful traffic.

  • Data Masking: Authorization headers, API keys, and PII in query parameters are redacted before storage, ensuring compliance with security policies.

  • Log Volume Reduction: Successful HTTP 2xx responses are sampled at 50%, while repeated errors are deduplicated within 30-second windows to prevent alert fatigue and reduce storage costs.

Bindplane Configuration

To implement this blueprint in your Bindplane deployment:

  1. Navigate to Blueprintsarrow-up-right and choose the Process HTTP Logs for ClickHouse Blueprint. Save it to your Library.

  2. Open the Processors section in any of your Bindplane configurations.

  3. Modify the Blueprint to fit your specific dataset and requirements.

  4. Ensure that the data is formatted correctly by comparing the Snapshotarrow-up-right you see to the Live Previewarrow-up-right, and validating data pre-processing looks good.

  5. Save your changes and rollout the configuration update to production.

Key Processing Steps

The blueprint applies the following transformations in order:

Parsing: JSON HTTP log bodies are parsed into structured attributes, enabling field-level processing downstream.

Filtering: Health check endpoints (e.g., /health, /healthz, /readiness, /ping, /metrics) are automatically excluded. Debug and trace-level logs are dropped unless explicitly marked with debug.keep="true".

Data Masking: Sensitive fields are redacted with asterisks:

  • Authorization headers and API keys

  • Passwords and tokens

  • Cookies and session identifiers

  • Client IP addresses

Cardinality Reduction: High-cardinality fields are removed to prevent storage explosion:

  • Request IDs and correlation IDs

  • Session IDs and user identifiers

  • User agent strings and client addresses

  • Trace IDs and span IDs (when trace correlation isn't required)

Normalization: Missing HTTP fields receive sensible defaults:

  • http.method defaults to UNKNOWN

  • http.status_code defaults to 0

  • service.name defaults to unknown

  • severity_text defaults to INFO

Sampling and Deduplication: Success logs (2xx status) are sampled at 50% to reduce volume. Error logs (4xx/5xx) and high-severity events are deduplicated within 30-second windows, collapsing repeated errors into single entries with an error_count field.

Batching: Logs are batched in groups of up to 10,000 following best practice recommendations for Clickhouse ingestion.arrow-up-right

Customizing the Blueprint

To adjust the blueprint's behavior for your use case:

  • Disable Health Check Filtering: Remove the "Filter Health Checks" processor if you need complete visibility into all endpoint traffic

  • Preserve Trace IDs: Set the keep.trace.context attribute to "true" on logs that need trace correlation

  • Adjust Sampling Rate: Modify the sampling processor's drop_ratio parameter (currently 0.50) to change the percentage of 2xx logs retained

  • Extend Masking Rules: Add additional patterns to the redaction rules if your logs contain custom sensitive fields

Integration with ClickHouse

The blueprint's output is optimized for ClickHouse's OTLP logs schema. Ensure your ClickHouse exporter is configured with:

  • Batch size: 5000-10000 rows (matching the blueprint's batch processor)

  • Connection timeout: At least 10s to accommodate network latency

  • Retry logic enabled for transient failures

Logs flow through the blueprint with semantic preservation, allowing queries like:

Monitoring and Troubleshooting

Monitor the following metrics to ensure healthy operation:

  • Log Drop Rate: The health check filter should exclude 10-30% of traffic in typical deployments

  • Error Deduplication: Check that error logs are collapsing (watch error_count distribution)

  • Batch Sizes: Monitor average batch sizes to confirm logs are batching efficiently

If you notice excessive log loss, verify that critical endpoints aren't matching the health check filter pattern. Use the debug.keep="true" attribute to preserve specific logs for troubleshooting.

circle-info

NOTE

This Blueprint has been tested against standard data patterns. You may need to adjust the configuration to match your specific data format.

Last updated

Was this helpful?