Log Hygiene for Event Storage in ClickHouse
Maintaining log quality at scale requires systematic preprocessing before data reaches your analytics platform.
Using the Process Logs for ClickHouse blueprint, you can establish a comprehensive pipeline that parses structured data, masks PII, eliminates high-cardinality noise, deduplicates records, and normalizes fields — ensuring clean, cost-effective data in ClickHouse.
Overview
The ingestion hygiene blueprint implements a general-purpose log cleaning pipeline suitable for JSON logs from applications, microservices, and cloud platforms. It addresses the core challenges of log analytics:
Structured Data Extraction: JSON log bodies are parsed into queryable attributes, transforming unstructured strings into analyzable fields.
Verbosity Filtering: DEBUG and TRACE logs are filtered by default, with exemptions for development environments and logs explicitly marked for retention.
PII Protection: Emails, IP addresses, phone numbers, credit cards, SSNs, and security credentials are automatically redacted, helping maintain compliance with data protection regulations.
Cardinality Management: High-cardinality fields like request IDs, session IDs, and container identifiers are removed to prevent query performance degradation and excessive storage costs.
Deduplication: Identical logs within 30-second windows are collapsed into single entries with count attributes, reducing log volume without losing signal about error frequency.
Bindplane Configuration
To implement this blueprint in your Bindplane deployment:
Navigate to Blueprints and choose the Process Logs for ClickHouse Blueprint. Save it to your Library.
Open the Processors section in any of your Bindplane configurations.
Modify the Blueprint to fit your specific dataset and requirements.
Ensure that the data is formatted correctly by comparing the Snapshot you see to the Live Preview, and validating data pre-processing looks good.
Save your changes and rollout the configuration update to production.
Processing Pipeline
The blueprint executes the following steps in sequence:
Parsing: JSON-formatted log bodies are parsed into attributes. This converts flat JSON strings into a structured form that downstream processors can access field-by-field.
Debug Filtering: DEBUG and TRACE severity logs are excluded by default, preserving only INFO, WARN, ERROR, and FATAL levels. Exceptions are made for:
Logs with
debug.keep="true"attributeServices running in the
devdeployment environment
This ensures development and troubleshooting logs don't bloat production analytics while allowing selective retention when needed.
Sensitive Data Masking: Multiple PII types are automatically redacted with asterisks:
Email addresses, phone numbers (US and international), credit cards, and SSNs
API keys, authorization headers, and bearer tokens
Passwords, secrets, and credentials
Session cookies and authentication tokens
Fields to be masked are identified by both pattern matching (e.g., (?i)^(password|secret|token)$) and semantic rules (e.g., credit card detection).
High-Cardinality Removal: Fields that create cardinality explosion are deleted:
Request/correlation/transaction IDs
Session IDs and user identifiers
Device and container IDs
Client IP addresses and browser user agent strings
Kubernetes pod UIDs and process IDs
Removing these fields prevents ClickHouse's map columns from fragmenting into thousands of unique keys, which degrades query performance and inflates storage.
Orphan Trace ID Cleanup: Trace IDs and span IDs are removed when keep.trace.context != "true", preventing unused distributed tracing data from creating unnecessary cardinality.
Field Normalization: Standard log fields receive defaults when missing:
severity_textdefaults toINFOservice.namedefaults tounknowndeployment.environmentdefaults toproduction
This ensures consistent schema across heterogeneous log sources.
Deduplication: Identical logs (matching body and key attributes) are collapsed within 30-second windows. Duplicate occurrences are recorded in a dedup_count field, allowing ClickHouse to track repetition frequency without storing redundant entries.
Batching: Logs are batched in groups up to 10,000 for optimized ClickHouse insert performance.
Customizing the Blueprint
Adjust the pipeline for your specific needs:
Preserve Debug Logs: Set
debug.keep="true"on logs you want retained despite DEBUG severityEnable Development Logs: Logs from services with
deployment.environment="dev"automatically bypass debug filteringAdjust Dedup Window: Change the
intervalparameter (currently 30 seconds) to tune sensitivity to repeated errorsExtend Sensitive Patterns: Add custom regex patterns to the redaction rules for domain-specific secrets
Preserve Trace Context: Set
keep.trace.context="true"on logs that need distributed tracing correlation
Working with Deduplicated Logs
When querying deduplicated logs in ClickHouse, account for the dedup_count field:
This query shows repeated warnings/errors, with dedup_count indicating how many times each unique log was deduplicated in the 30-second window.
Integration with ClickHouse
The blueprint produces logs compatible with ClickHouse's OTLP schema. Configure your ClickHouse exporter with:
Batch size: 5000-10000 (matching the blueprint's batch processor)
Timeout: 5+ seconds to allow batching
Retry mechanism: Enabled for reliability
The combination of deduplication and batching significantly reduces ingestion volume while preserving observability.
Monitoring Pipeline Health
Track these metrics to ensure the hygiene pipeline is functioning correctly:
Debug Log Drop Rate: Should be 30-60% depending on log verbosity in your services
PII Masking Rate: Monitor redacted field occurrences to validate sensitive data handling
Dedup Effectiveness: Check the
dedup_countdistribution to confirm duplicate collapsing is workingCardinality Reduction: Compare attribute counts before/after the pipeline to measure high-cardinality field removal
NOTE
This Blueprint has been tested against standard data patterns. You may need to adjust the configuration to match your specific data format.
Last updated
Was this helpful?