Edge Sidecar

This page covers reliability, resource requirements, failure modes, and pre-production testing procedures for all Edge Sidecar apps. For product-specific questions about Edge Reporter, Regulator, and Optimizer, see the FAQs listed below in the navigation.

For detailed questions about each product:

Edge Reporter — Cost analysis, reduction projections, ROI calculations before deploying optimization
Edge Regulator — Event sampling and budget enforcement based on log severity and type
Edge Optimizer — Lossless log optimization via compact event format and search-time expansion

Resource Requirements

What are the resource requirements

The 10x Engine runs on the JVM (HotSpot or GraalVM Native Image) — at the edge as a sidecar. You control both CPU and memory:

Memory: Set via -Xmx (e.g., -Xmx512m). The JVM heap won't exceed this ceiling. The JVM also allocates memory for metaspace, thread stacks, and JIT code cache — budget ~2x the heap for the full JVM footprint (e.g., ~1 GB for -Xmx512m). For forwarders that run 10x as a sidecar container (OTel Collector, Logstash), set resources.limits.memory: 1Gi on the 10x container. For forwarders that embed 10x (Fluentd, Fluent Bit, Filebeat), add ~1 GB to the forwarder container's memory limit
CPU: Set via threadPoolSize — a fixed thread count (e.g., 2) or a fraction of available cores (e.g., 0.25 = 25%)

A single node with 512 MB heap and 2 threads handles 100+ GB/day. Sub-millisecond per event — template matching is a hash lookup against cached hidden classes, not regex evaluation or JSON parsing. Processing is in-memory with no disk I/O. Backpressure throttles input if the pipeline approaches its resource limit.

Your daily volume	Nodes (2 threads each)	Headroom
1 TB/day	10 nodes	10x
5 TB/day	50 nodes	2x per node
15 TB/day	150 nodes	2x per node

Both values map directly to standard Kubernetes resource specs in your DaemonSet manifest. The engine communicates with your log forwarder via IPC (inter-process communication) — no network hop, no config changes to your existing forwarder.

What are the per-node capacity limits

The engine scales linearly with memory and CPU. Plan your resources based on daily log volume and desired headroom:

Daily volume	Recommended config	Optimization latency	Concurrent events
<10 GB	256 MB heap, 1 thread	<1ms batch	10–100
10–50 GB	512 MB heap, 2 threads	1–2ms batch	100–1K
50–100 GB	1 GB heap, 4 threads	2–5ms batch	1K–10K
100+ GB	2 GB heap, 8 threads	5–10ms batch	10K+

Scaling beyond 150 GB/day: Add more nodes (scale horizontally). Each node's sidecar operates independently — no state sharing, no coordination overhead. A 300 GB/day environment requires ~3 nodes at 100 GB/day capacity each.

Reduction ratio remains constant across volume scales (50–70% for typical application logs) — doesn't degrade as volume increases.

How do I know if I need to scale the sidecar

Signs you need more resources:

Sidecar CPU consistently >80% of its limit
Sidecar memory consistently >90% of its limit
Optimization latency increases noticeably (captured in metrics)
Queue depth growing (unprocessed events accumulating)

How to measure:

Enable metrics export: metrics.export.prometheus: true in config
Query Prometheus: histogram_quantile(0.99, rate(engine_process_duration_ms[5m]))
If p99 latency >50ms, consider scaling
Check for backpressure triggers in logs: "input throttled" or "queue full"

How to scale:

Horizontal (recommended): Add more nodes via kubectl scale daemonset ... — DaemonSet auto-deploys the sidecar
Vertical: Increase heap/thread limits on existing nodes — requires rolling restart
Best practice: Horizontal scaling provides better fault isolation and easier operational management

Performance Characteristics

What reduction ratios should I expect

Depends on your log mix. Typical ranges:

Scenario	Reduction
K8s and OTel workloads	50–65%
Highly structured events	up to 8x (87.5%)
Grouped instances (stack traces)	>90%
Edge Optimizer + Storage Streamer combined	80%+

Example: A Kubernetes pod_workers.go event at 1,835 bytes raw becomes 662 bytes compact — 64% reduction. A verbose OTel log at 4,265 bytes becomes 520 bytes — 88% reduction.

Run Dev on your own log files to measure your actual ratio — free, no account needed.

What is the search-time overhead for compact events

Depends on your log platform:

Splunk: The open-source 10x for Splunk app expands compact events transparently at search time. A one-time template resolution (~0.5–2s per search) matches search terms against the template index. Per-event expansion uses a KV Store primary-key lookup and native SPL functions — negligible overhead. Queries, dashboards, and alerts work unchanged
Elasticsearch (self-hosted): The L1ES Lucene plugin expands compact events during search at ~1.25x search time. 50% less indexed volume offsets the expansion cost — fewer data nodes, less SSD, lower compute. For managed Elasticsearch (Elastic Cloud, OpenSearch Service), Storage Streamer expands and re-indexes on-demand
Datadog / CloudWatch: Edge Regulator sends events in standard format — no expansion needed. For events routed to S3 via Edge Optimizer, Storage Streamer expands and streams to your platform on-demand

How does the engine achieve high throughput

Sub-millisecond per event. Each event is matched to a cached TenXTemplate via hash lookup — not parsed from scratch with regex. The AOT compiler scans source code and containers to generate symbol libraries. At runtime, the engine uses those symbols to dynamically assign cached hidden classes — one per event type. Structure is resolved once per type, not per instance.

Scale throughput by adding threads via threadPoolSize — a fixed count (e.g., 4) or a fraction of cores (e.g., 0.5 = 50%). Events are batched (1,000 per batch, 2s flush interval) and distributed across the thread pool.

The engine runs on the JVM (HotSpot or GraalVM Native Image) at the edge or in the cloud. Memory is explicitly capped via -Xmx — no runaway consumption. No SDKs, no bytecode injection, no application overhead. Application code runs exactly as written.

What happens during traffic spikes

Backpressure throttles input when the pipeline approaches its resource limit:

Interval byte limit — cap bytes read per time window (e.g., 10 MB/min)
Total byte/event limit — cap total volume from a source
Total duration — cap how long an input reads

Design: Backpressure defers buffering to the forwarder

By design, the sidecar doesn't buffer events itself — that job belongs to your forwarder, which is purpose-built for it:

Sidecar queue fills → input read blocks (the Unix socket/pipe)
Forwarder detects the blocked pipe → OS socket/pipe buffer fills up
Forwarder's buffering takes over (Fluentd file buffer, Fluent Bit mmap, OTel file_storage, Logstash persistent queue)
Once the sidecar queue drains and load decreases, the forwarder drains its buffer
No crashes, no heap overflow, no data loss

This approach is simpler, more reliable, and avoids duplicate buffering logic. Combined with fail-open design, log delivery is never interrupted.

Failure Modes

Is 10x highly available

No single point of failure. Edge apps run as DaemonSets — one sidecar per node. Each sidecar operates independently with no shared state or coordination between nodes. A failure on one node has no effect on any other node.

Node-level isolation: Each sidecar processes only its own node's logs. No cross-node dependencies, no leader election, no quorum
Fail-open: If a sidecar fails, the forwarder on that node continues sending logs at full volume. You temporarily lose cost savings on that node — not observability
Automatic recovery: Standard Kubernetes restart policies and health probes restart failed sidecars without manual intervention
Cloud apps: Cloud Reporter and Storage Streamer run as standard pods — scale replicas for redundancy. Both are read-only and async to the data flow

What happens if the Edge sidecar fails

Fail-open design with forwarder-specific recovery:

During sidecar crash/OOM: Logs bypass optimization and flow at full volume directly from the forwarder to your destination (Splunk, Datadog, Elasticsearch, CloudWatch)
Data loss: NO — fail-open design preserves all logs. You temporarily lose the cost benefit on that node only
Cost impact: During the outage, that node's logs ship unoptimized. Once sidecar restarts, normal processing resumes

Recovery behavior depends on your forwarder:

Forwarder	Recovery	Latency	Manual Action
Fluentd	Auto-respawn (built-in exec_filter)	<10s	None
Fluent Bit	Auto-respawn via Lua script (0-10x)	0-50s	None (transient) / Check logs (persistent)
Filebeat	Auto-respawn via supervisor script + K8s probe	<10s	None (unless persistent failures)
OTel Collector	Auto-restart via K8s liveness probe	<30s	None

For Fluentd & Filebeat (recommended): The forwarder automatically restarts the sidecar. You'll see a brief spike in unoptimized logs (10-30 seconds), then normal processing resumes.

For Fluent Bit: The Lua-based supervisor respawns 10x up to 10 times with exponential backoff. For transient crashes, recovery is automatic. For persistent crashes, check TENX_RUN_ARGS and logs after 50 seconds.

For OTel Collector: The 10x sidecar is a separate container in the same pod. A Kubernetes liveness probe monitors the sidecar and automatically restarts it if it crashes. OTel buffers events while 10x is down. Once the sidecar restarts, OTel drains the queue automatically. By design, you don't manage the restart — the cluster does.

Why fail-open? If the sidecar fails, your forwarder continues operating normally. No logs are lost, and observability continues unchanged — you just lose cost savings temporarily on that node.

Monitoring: Configure alerts for: - pod_restart_count > 0 — track sidecar restarts - Sidecar liveness probe failures — K8s will restart after 2 consecutive failures

What happens if the sidecar can't keep up with log volume

When the sidecar can't keep up, the forwarder buffers events to disk:

Scenario: Log volume spikes beyond the sidecar's capacity (e.g., 100+ GB/day on a 512 MB, 2-thread node)
What happens: Sidecar queue fills → sidecar stops reading from the forwarder → forwarder detects this and buffers events to disk instead
Forwarder buffering: Each forwarder handles this natively (Fluentd file buffer, Fluent Bit mmap storage, OTel file_storage, Logstash persistent queue)
Recovery: When the sidecar catches up and queue drains, the forwarder drains its buffer. All events are processed in order
Data loss: NO — all events are queued or delivered

Prevention: Monitor sidecar CPU/memory usage. If consistently >90%, either: - Scale horizontally by adding more nodes (DaemonSet auto-deploys) - Scale vertically by increasing heap/thread limits - Reduce noisy log volume using Edge Regulator

Note: Backpressure is a protective feature, not a failure — it prevents cascade effects and protects downstream systems by letting the forwarder's buffering absorb the spike.

What if the forwarder crashes while the sidecar is running

Behavior depends on your deployment model — both safe:

Embedded sidecars (Fluentd, Fluent Bit, Logstash, Filebeat): - The 10x sidecar is spawned as a child process by the forwarder. When the forwarder crashes, the sidecar crashes with it - When the forwarder restarts, it automatically respawns 10x - Recovery times (same as when 10x crashes): - Fluentd: <10s (built-in respawn) - Fluent Bit: 0-50s (Lua supervisor with backoff) - Filebeat: <10s (supervisor script) — disk queue must be enabled in Filebeat config to buffer events while 10x is down - Logstash: <10s (pipe output auto-respawn) - Data loss: NO — your forwarder's buffer holds all events until 10x resumes - Cost impact: During downtime, logs ship unoptimized at full volume. Once 10x restarts, normal processing resumes

Standalone sidecar (OTel Collector): - The 10x sidecar is a separate container listening on a Unix socket. It waits passively for OTel to reconnect - OTel buffers events locally (configurable limit, typically 1GB = hours of data) while 10x is paused - Recovery: Automatic when OTel restarts - Data loss: NO — OTel's buffer preserves all events

In all cases: Your forwarder handles all input/output resilience (buffering, retries, batching). The sidecar just optimizes data in-memory and doesn't maintain external network connections.

What if network connectivity is interrupted

Two different paths — both safe:

Sidecar ↔ Forwarder (local IPC): Both run on the same node. They communicate via local Unix sockets (or in-process if embedded). Network interruptions have NO effect on this path
Forwarder ↔ SIEM (network outage): This is handled entirely by your forwarder (Fluentd, Filebeat, OTel Collector, etc.), not by 10x. Your forwarder's buffering, retries, and failover logic apply as normal

The 10x sidecar doesn't maintain external network connections. Log optimization happens locally and independently of network state. Network resilience is the responsibility of your forwarder and destination.

How does Filebeat handle sidecar crashes (Kubernetes)

Filebeat uses a supervisor script with Kubernetes liveness probe:

Filebeat is unique because it doesn't have native subprocess respawn like Fluentd. To safely run Filebeat with 10x in Kubernetes, we use a two-level recovery mechanism:

Level 1: Supervisor Script (in container) - When 10x crashes, the pipe breaks - Supervisor script detects this and respawns 10x automatically - Respawn happens in <5 seconds - Filebeat's output is buffered, so logs don't get blocked

Level 2: Kubernetes Liveness Probe (at pod level) - Every 5 seconds, K8s checks if 10x process is running: pgrep -f '10x run' - Also checks Filebeat HTTP endpoint: curl 127.0.0.1:5066 - If 10x is missing for 2 consecutive checks (10 seconds), K8s restarts the pod - Acts as a safety net for persistent issues

Recovery Timeline: - T=0s: 10x crashes - T=<5s: Supervisor detects and respawns - T=5-10s: New 10x JVM warming up - T=10s: Ready, resume processing - Total: ~10 seconds

If respawn fails repeatedly: - Supervisor keeps retrying (5s delay between attempts) - K8s liveness probe detects the pattern after 2 failures - K8s kills and restarts the entire pod - Logs buffer during restart, resume after pod starts

Monitoring: - Watch kubectl describe pod <filebeat-pod> for LastProbeTime on liveness probe - Check pod logs for supervisor respawn messages - Alert on pod_restart_count > 0 to catch persistent issues

This approach matches Fluentd's safety (automatic respawn) while working within Filebeat's architectural constraints (single output pipe).

What happens during other failures

Each scenario has independent recovery:

Failure	Behavior
Node restart	Standard DaemonSet scheduling — sidecar starts with the node
Symbol library unavailable	Sidecar uses its last-fetched library version. JIT creates templates at runtime for unmatched events
Output destination down	Your forwarder handles output buffering and retries — 10x does not change this behavior
License server unreachable	Sidecar continues operating with cached license. No interruption to processing
Cloud Reporter/Streamer unavailable	Cloud apps are read-only and async to the data flow — failures don't block edge log delivery

Monitoring: Use Prometheus metrics and Grafana dashboards with alerts to detect failures and coordinate recovery. DaemonSet restart policies and health probes ensure automatic restart without manual intervention.

What if the downstream SIEM is slow or unreachable

The sidecar is upstream of the SIEM — it outputs to the forwarder via local IPC, not directly to the SIEM. A slow or unreachable SIEM is your forwarder's concern. Fluentd, Fluent Bit, OTel Collector, and the others all have their own output buffering, retries, and backoff for this scenario. The sidecar is unaffected.

Fail-open still applies: if the sidecar itself goes down while the SIEM is slow, logs bypass optimization and flow directly from the forwarder to the destination — no logs lost.

Operations & Deployment

How do I roll back

Remove the DaemonSet. Logs flow directly from your forwarder to your analytics tool — no 10x in the path. Rollback is a single Helm command:

helm uninstall my-edge-optimizer --namespace logging

Or roll back to a previous chart version:

helm rollback my-edge-optimizer --namespace logging

Blast radius: Only the 10x sidecar is removed. Your forwarder (Fluentd, Fluent Bit, OTel, Filebeat) continues running unchanged
Previously stored data: Compact data already indexed in your analyzer (Splunk, Elasticsearch) remains searchable via the expansion plugin or Storage Streamer. Removing 10x does not affect previously indexed data
Logs during rollback: Once the sidecar is removed, logs ship at full volume to your analyzer
No data migration needed — all logs stay in your infrastructure

How do I upgrade in production

Standard Helm upgrade with rolling updates. Kubernetes updates one DaemonSet pod at a time — no cluster-wide downtime:

helm repo update
helm upgrade my-edge-optimizer log10x-fluent/fluent-bit \
  -f my-edge-optimizer.yaml \
  --namespace logging

Rolling update: Each node's pod restarts individually. While a pod restarts, logs on that node may be dropped until the sidecar recovers
Symbol library updates: Distributed via GitOps. Edge and cloud apps pull updates automatically at a configurable interval — no restart needed
Configuration changes: Managed via GitOps or Helm values. Changes propagate to running pods without manual per-node intervention

Delivery Guarantees

Can events be processed twice? What are the ordering guarantees

At-least-once delivery. If the forwarder sends an event twice — for example, due to a retry after a sidecar restart — it will be processed twice. The sidecar has no deduplication.

Ordering is preserved. Events are processed in the order they arrive via stdin or Unix socket. The sidecar preserves forwarder-side ordering — no events are reordered.

What happens to templates when the sidecar restarts

Templates are rebuilt from the AOT symbol library on startup — nothing is persisted across restarts. Events that arrive before the library finishes loading are handled by the JIT engine, which builds templates at runtime from the events themselves.

Product Safety Features

Does Log10x have a shadow/pass-through mode for safe evaluation

Edge Reporter is the shadow mode. It runs as a read-only sidecar — same DaemonSet deployment, same forwarder integration — but never modifies or optimizes events. It reports cost attribution and reduction projections before you deploy Edge Optimizer.

Edge Optimizer itself has no pass-through mode. The standard evaluation workflow is: deploy Edge Reporter first, validate the projections, then swap in Edge Optimizer when you're confident.

Is Edge Regulator lossy? How do I protect critical events

Yes — Edge Regulator is explicitly sampling-based and lossy by design. Events that exceed their budget share are sampled down proportionally. If DEBUG events exceed 20% of your hourly budget, excess ones are dropped. That's the point — it's a budget enforcement tool.

How to protect critical events:

Severity boost: ERROR and WARN get a higher retention multiplier — they're sampled last, after INFO and DEBUG are reduced
Per-type budget floors: Configure a minimum retention floor per event type so nothing is sampled below a threshold
Allowlist: Don't regulate event types you can't afford to lose (security logs, audit logs, compliance events)

Edge Optimizer (lossless) is a completely different product — it never drops events. Use Edge Reporter + Edge Optimizer for cost reduction without data loss. Add Edge Regulator only for event types you've validated are safe to sample.

Does Edge Optimizer deduplicate events

No — this is not deduplication. Edge Optimizer stores the structure of a log type (field names, patterns, positions) once per type. Each event stores only its unique values for that structure.

Two nginx 404s with different URLs remain fully separate events — both URLs are preserved. Nothing is merged, fingerprinted, or deduplicated. Think of it like Protocol Buffers: the schema is shared, but every message retains its own data.

Throughput & Limits

What are the throughput limits

The sidecar processes ~1 MB/sec per thread at 512 MB heap — roughly 86 GB/day per thread at baseline. Throughput scales linearly with thread count:

Config	Throughput	Daily volume
512 MB heap, 1 thread	~1 MB/s	~86 GB/day
512 MB heap, 2 threads	~2 MB/s	~170 GB/day
1 GB heap, 4 threads	~4 MB/s	~340 GB/day
2 GB heap, 8 threads	~8 MB/s	~690 GB/day

When the sidecar exceeds its processing capacity, backpressure activates. The sidecar blocks on input read, and your forwarder buffers events to disk. The sidecar never crashes from volume alone — it blocks, allowing the forwarder to handle the backlog.

Data Portability

Can I expand events if I stop using Log10x

Yes — the data format is fully readable and the expansion tools are open source:

Splunk: The 10x for Splunk app is open source on GitHub, with documented expand utilities
Elasticsearch: The L1ES Lucene plugin is open source on GitHub, with documented expand utilities
Python utility: A standalone Python expander is available for scripting and offline access

Events are stored as readable text — not binary blobs. You can inspect, expand, and migrate them without Log10x running. No vendor lock-in on the data format.

Pre-Production Testing

How should I test before production

Three-phase approach — local validation, staging verification, then phased production rollout:

Phase 1: Local Dev Tool (30 minutes)

Export sample logs from production (100K events, representative traffic)
Run Dev locally on your sample
Verify: - Optimization/filtering ratio matches expected range - No data loss (all fields present in expanded output) - Parsing accuracy (custom formats detected correctly)

Phase 2: Staging DaemonSet (24 hours)

Deploy sidecar as DaemonSet in staging cluster (same config as production target)
Run your actual staging workload through it
Monitor: - Sidecar pod health (1/1 Running, 0 restarts) - CPU/memory usage (should be <80%) - Optimization ratio (should match Dev tool results) - Query latency in staging SIEM (expect ~1.25x overhead for Splunk/ES)
Validate dashboard queries still work
Leave running for 24 hours — check for OOM/restart patterns or memory leaks

Phase 3: Production Rollout (staged canary)

Canary (Day 1): Deploy to 1–2 nodes. Monitor for 24 hours
Ramp (Days 2–3): Deploy to 25% of nodes. Monitor for issues
Full (Day 4+): Deploy to remaining nodes. Monitor continuously

What should I validate before each deployment phase

Pre-deployment checklist:

[ ] Sidecar pod is healthy (1/1 Running, 0 restarts)
[ ] CPU usage <70% p95
[ ] JVM heap not approaching -Xmx limit
[ ] Logs arriving at SIEM (no drop in event count)
[ ] Optimization ratio ±5% of expected (e.g., expected 60%, actual 55–65%)
[ ] Query latency acceptable (if using Splunk/ES, expect ~1.25x overhead)
[ ] Dashboards and saved searches render correctly
[ ] Alerts firing normally (no delays or delivery failures)
[ ] Cost metrics showing in Log10x Console (if available)
[ ] No unexpected OOM or restart events in the previous 24 hours

If any check fails: Do not proceed to next phase. Investigate root cause before advancing, or contact support for help.

How do I roll back if something goes wrong

Immediate rollback:

kubectl delete daemonset log10x-optimizer -n log-collection

Execution time: ~1 minute for the DaemonSet to terminate all pods
Log delivery: Forwarders automatically resume normal operation (direct to destination)
Data loss: NO — nothing is lost during rollback
Recovery monitoring: After 2 minutes, log volume at SIEM should return to normal (pre-optimization rate)

Production safety checklist: Before, during, and after deployment

Pre-deployment (Day 0):

[ ] Staging validation passed for 24+ hours without issues
[ ] Canary deployment plan documented (which nodes, rollback trigger)
[ ] Alerting configured for sidecar health:
[ ] Sidecar restarts (pod_restart_count > 0)
[ ] High CPU (container_cpu_usage > 80%)
[ ] High memory (container_memory_usage > 90%)
[ ] Slow processing (engine_process_duration_p99 > 50ms)
[ ] Rollback procedure tested (helm uninstall command ready)
[ ] Team trained on what to monitor during deployment
[ ] On-call escalation path defined (who to contact if issues arise)

During canary deployment (Days 1-3):

Canary phase (Day 1: 1-2 nodes):
- [ ] Sidecar pods healthy: kubectl get pods -l app=log10x-optimizer
- [ ] No restart loops: kubectl logs -l app=log10x-optimizer | grep ERROR
- [ ] CPU/memory within limits: kubectl top pods -l app=log10x-optimizer
- [ ] Events arriving at SIEM: Verify ingestion rate hasn't dropped >5%
- [ ] Optimization ratio confirmed: Check metrics in Log10x Console or Prometheus
- [ ] No alerts triggered: Monitor alerting system for sidecar health alerts
- Hold for 24 hours before proceeding
Ramp phase (Days 2-3: 25% of fleet):
- [ ] Repeat canary checks on new nodes
- [ ] Compare SIEM query latency across canary vs ramp nodes (expect ~1.25x overhead)
- [ ] Check for pod affinity issues (pods not scheduling evenly)
- [ ] Verify no correlated failures (e.g., all pods on one type failing)
Full deployment (Days 4+):
- [ ] All nodes running sidecar successfully
- [ ] Fleet-wide metrics stable (no degradation)
- [ ] Cost metrics showing in reporting dashboard
- [ ] Team confident in monitoring procedures

Post-deployment (ongoing):

Weekly:
- [ ] Sidecar restart count trending down (expect near-zero)
- [ ] CPU/memory usage stable and predictable
- [ ] Optimization ratio consistent with expectations
- [ ] SIEM query performance stable (no regression)
Monthly:
- [ ] Cost savings tracking against projections
- [ ] No memory leak indicators (heap size not growing unbounded)
- [ ] Update sidecar image to latest patch version
- [ ] Review and tune threadPoolSize if CPU utilization changed
At first sign of problems:
- [ ] Check sidecar logs: kubectl logs -l app=log10x-optimizer --tail=100
- [ ] Review metrics: Check CPU, memory, and processing latency in Prometheus
- [ ] Verify forwarder health: Is the forwarder still running normally?
- [ ] If unrecoverable: Execute rollback (kubectl delete daemonset log10x-optimizer)