# Log Hygiene for Event Storage in ClickHouse

Using the [**Process Logs for ClickHouse**](https://app.bindplane.com/p/01J06XSD7F4KT3D0XDE2VQDAR5/blueprints) blueprint, you can establish a comprehensive  pipeline that parses structured data, masks PII, eliminates high-cardinality noise, deduplicates records, and normalizes fields — ensuring clean, cost-effective data in ClickHouse.

### Overview

The ingestion hygiene blueprint implements a general-purpose log cleaning pipeline suitable for JSON logs from applications, microservices, and cloud platforms. It addresses the core challenges of log analytics:

* **Structured Data Extraction**: JSON log bodies are parsed into queryable attributes, transforming unstructured strings into analyzable fields.
* **Verbosity Filtering**: DEBUG and TRACE logs are filtered by default, with exemptions for development environments and logs explicitly marked for retention.
* **PII Protection**: Emails, IP addresses, phone numbers, credit cards, SSNs, and security credentials are automatically redacted, helping maintain compliance with data protection regulations.
* **Cardinality Management**: High-cardinality fields like request IDs, session IDs, and container identifiers are removed to prevent query performance degradation and excessive storage costs.
* **Deduplication**: Identical logs within 30-second windows are collapsed into single entries with count attributes, reducing log volume without losing signal about error frequency.

### Bindplane Configuration

To implement this blueprint in your Bindplane deployment:

1. Navigate to [Blueprints](https://app.bindplane.com/p/01J06XSD7F4KT3D0XDE2VQDAR5/blueprints) and choose the **Process Logs for ClickHouse** Blueprint. Save it to your Library.
2. Open the **Processors** section in any of your Bindplane configurations.
3. Modify the Blueprint to fit your specific dataset and requirements.
4. Ensure that the data is formatted correctly by comparing the [Snapshot](https://docs.bindplane.com/feature-guides/snapshots) you see to the [Live Preview](https://docs.bindplane.com/feature-guides/live-preview), and validating data pre-processing looks good.
5. Save your changes and rollout the configuration update to production.

#### Processing Pipeline

The blueprint executes the following steps in sequence:

**Parsing**: JSON-formatted log bodies are parsed into attributes. This converts flat JSON strings into a structured form that downstream processors can access field-by-field.

**Debug Filtering**: DEBUG and TRACE severity logs are excluded by default, preserving only INFO, WARN, ERROR, and FATAL levels. Exceptions are made for:

* Logs with `debug.keep="true"` attribute
* Services running in the `dev` deployment environment

This ensures development and troubleshooting logs don't bloat production analytics while allowing selective retention when needed.

**Sensitive Data Masking**: Multiple PII types are automatically redacted with asterisks:

* Email addresses, phone numbers (US and international), credit cards, and SSNs
* API keys, authorization headers, and bearer tokens
* Passwords, secrets, and credentials
* Session cookies and authentication tokens

Fields to be masked are identified by both pattern matching (e.g., `(?i)^(password|secret|token)$`) and semantic rules (e.g., credit card detection).

**High-Cardinality Removal**: Fields that create cardinality explosion are deleted:

* Request/correlation/transaction IDs
* Session IDs and user identifiers
* Device and container IDs
* Client IP addresses and browser user agent strings
* Kubernetes pod UIDs and process IDs

Removing these fields prevents ClickHouse's map columns from fragmenting into thousands of unique keys, which degrades query performance and inflates storage.

**Orphan Trace ID Cleanup**: Trace IDs and span IDs are removed when `keep.trace.context != "true"`, preventing unused distributed tracing data from creating unnecessary cardinality.

**Field Normalization**: Standard log fields receive defaults when missing:

* `severity_text` defaults to `INFO`
* `service.name` defaults to `unknown`
* `deployment.environment` defaults to `production`

This ensures consistent schema across heterogeneous log sources.

**Deduplication**: Identical logs (matching body and key attributes) are collapsed within 30-second windows. Duplicate occurrences are recorded in a `dedup_count` field, allowing ClickHouse to track repetition frequency without storing redundant entries.

**Batching**: Logs are batched in groups up to 10,000 for optimized ClickHouse insert performance.

### Customizing the Blueprint

Adjust the pipeline for your specific needs:

* **Preserve Debug Logs**: Set `debug.keep="true"` on logs you want retained despite DEBUG severity
* **Enable Development Logs**: Logs from services with `deployment.environment="dev"` automatically bypass debug filtering
* **Adjust Dedup Window**: Change the `interval` parameter (currently 30 seconds) to tune sensitivity to repeated errors
* **Extend Sensitive Patterns**: Add custom regex patterns to the redaction rules for domain-specific secrets
* **Preserve Trace Context**: Set `keep.trace.context="true"` on logs that need distributed tracing correlation

### Working with Deduplicated Logs

When querying deduplicated logs in ClickHouse, account for the `dedup_count` field:

```sql
SELECT body, dedup_count, COUNT() as occurrences
FROM otel_logs
WHERE severity_number >= 13
GROUP BY body, dedup_count
ORDER BY dedup_count DESC
```

This query shows repeated warnings/errors, with `dedup_count` indicating how many times each unique log was deduplicated in the 30-second window.

### Integration with ClickHouse

The blueprint produces logs compatible with ClickHouse's OTLP schema. Configure your ClickHouse exporter with:

* Batch size: 5000-10000 (matching the blueprint's batch processor)
* Timeout: 5+ seconds to allow batching
* Retry mechanism: Enabled for reliability

The combination of deduplication and batching significantly reduces ingestion volume while preserving observability.

### Monitoring Pipeline Health

Track these metrics to ensure the hygiene pipeline is functioning correctly:

* **Debug Log Drop Rate**: Should be 30-60% depending on log verbosity in your services
* **PII Masking Rate**: Monitor redacted field occurrences to validate sensitive data handling
* **Dedup Effectiveness**: Check the `dedup_count` distribution to confirm duplicate collapsing is working
* **Cardinality Reduction**: Compare attribute counts before/after the pipeline to measure high-cardinality field removal

{% hint style="info" %}
**NOTE**

This Blueprint has been tested against standard data patterns. You may need to adjust the configuration to match your specific data format.
{% endhint %}
