Edge Sidecar
This page covers reliability, resource requirements, failure modes, and pre-production testing procedures for all Edge Sidecar apps. For product-specific questions about Edge Reporter, Regulator, and Optimizer, see the FAQs listed below in the navigation.
For detailed questions about each product:
-
Edge Reporter — Cost analysis, reduction projections, ROI calculations before deploying optimization
-
Edge Regulator — Event sampling and budget enforcement based on log severity and type
-
Edge Optimizer — Lossless log optimization via compact event format and search-time expansion
Resource Requirements
What are the resource requirements
The 10x Engine runs on the JVM (HotSpot or GraalVM Native Image) — at the edge as a sidecar. You control both CPU and memory:
- Memory: Set via
-Xmx(e.g.,-Xmx512m). The JVM heap won't exceed this ceiling. The JVM also allocates memory for metaspace, thread stacks, and JIT code cache — budget ~2x the heap for the full JVM footprint (e.g., ~1 GB for-Xmx512m). For forwarders that run 10x as a sidecar container (OTel Collector, Logstash), setresources.limits.memory: 1Gion the 10x container. For forwarders that embed 10x (Fluentd, Fluent Bit, Filebeat), add ~1 GB to the forwarder container's memory limit - CPU: Set via
threadPoolSize— a fixed thread count (e.g.,2) or a fraction of available cores (e.g.,0.25= 25%)
A single node with 512 MB heap and 2 threads handles 100+ GB/day. Sub-millisecond per event — template matching is a hash lookup against cached hidden classes, not regex evaluation or JSON parsing. Processing is in-memory with no disk I/O. Backpressure throttles input if the pipeline approaches its resource limit.
| Your daily volume | Nodes (2 threads each) | Headroom |
|---|---|---|
| 1 TB/day | 10 nodes | 10x |
| 5 TB/day | 50 nodes | 2x per node |
| 15 TB/day | 150 nodes | 2x per node |
Both values map directly to standard Kubernetes resource specs in your DaemonSet manifest. The engine communicates with your log forwarder via IPC (inter-process communication) — no network hop, no config changes to your existing forwarder.
What are the per-node capacity limits
The engine scales linearly with memory and CPU. Plan your resources based on daily log volume and desired headroom:
| Daily volume | Recommended config | Optimization latency | Concurrent events |
|---|---|---|---|
| <10 GB | 256 MB heap, 1 thread | <1ms batch | 10–100 |
| 10–50 GB | 512 MB heap, 2 threads | 1–2ms batch | 100–1K |
| 50–100 GB | 1 GB heap, 4 threads | 2–5ms batch | 1K–10K |
| 100+ GB | 2 GB heap, 8 threads | 5–10ms batch | 10K+ |
Scaling beyond 150 GB/day: Add more nodes (scale horizontally). Each node's sidecar operates independently — no state sharing, no coordination overhead. A 300 GB/day environment requires ~3 nodes at 100 GB/day capacity each.
Reduction ratio remains constant across volume scales (50–70% for typical application logs) — doesn't degrade as volume increases.
How do I know if I need to scale the sidecar
Signs you need more resources:
- Sidecar CPU consistently >80% of its limit
- Sidecar memory consistently >90% of its limit
- Optimization latency increases noticeably (captured in metrics)
- Queue depth growing (unprocessed events accumulating)
How to measure:
- Enable metrics export:
metrics.export.prometheus: truein config - Query Prometheus:
histogram_quantile(0.99, rate(engine_process_duration_ms[5m])) - If p99 latency >50ms, consider scaling
- Check for backpressure triggers in logs: "input throttled" or "queue full"
How to scale:
- Horizontal (recommended): Add more nodes via
kubectl scale daemonset ...— DaemonSet auto-deploys the sidecar - Vertical: Increase heap/thread limits on existing nodes — requires rolling restart
- Best practice: Horizontal scaling provides better fault isolation and easier operational management
Performance Characteristics
What reduction ratios should I expect
Depends on your log mix. Typical ranges:
| Scenario | Reduction |
|---|---|
| K8s and OTel workloads | 50–65% |
| Highly structured events | up to 8x (87.5%) |
| Grouped instances (stack traces) | >90% |
| Edge Optimizer + Storage Streamer combined | 80%+ |
Example: A Kubernetes pod_workers.go event at 1,835 bytes raw becomes 662 bytes compact — 64% reduction. A verbose OTel log at 4,265 bytes becomes 520 bytes — 88% reduction.
Run Dev on your own log files to measure your actual ratio — free, no account needed.
What is the search-time overhead for compact events
Depends on your log platform:
- Splunk: The open-source 10x for Splunk app expands compact events transparently at search time. A one-time template resolution (~0.5–2s per search) matches search terms against the template index. Per-event expansion uses a KV Store primary-key lookup and native SPL functions — negligible overhead. Queries, dashboards, and alerts work unchanged
- Elasticsearch (self-hosted): The L1ES Lucene plugin expands compact events during search at ~1.25x search time. 50% less indexed volume offsets the expansion cost — fewer data nodes, less SSD, lower compute. For managed Elasticsearch (Elastic Cloud, OpenSearch Service), Storage Streamer expands and re-indexes on-demand
- Datadog / CloudWatch: Edge Regulator sends events in standard format — no expansion needed. For events routed to S3 via Edge Optimizer, Storage Streamer expands and streams to your platform on-demand
How does the engine achieve high throughput
Sub-millisecond per event. Each event is matched to a cached TenXTemplate via hash lookup — not parsed from scratch with regex. The AOT compiler scans source code and containers to generate symbol libraries. At runtime, the engine uses those symbols to dynamically assign cached hidden classes — one per event type. Structure is resolved once per type, not per instance.
Scale throughput by adding threads via threadPoolSize — a fixed count (e.g., 4) or a fraction of cores (e.g., 0.5 = 50%). Events are batched (1,000 per batch, 2s flush interval) and distributed across the thread pool.
The engine runs on the JVM (HotSpot or GraalVM Native Image) at the edge or in the cloud. Memory is explicitly capped via -Xmx — no runaway consumption. No SDKs, no bytecode injection, no application overhead. Application code runs exactly as written.
What happens during traffic spikes
Backpressure throttles input when the pipeline approaches its resource limit:
- Interval byte limit — cap bytes read per time window (e.g., 10 MB/min)
- Total byte/event limit — cap total volume from a source
- Total duration — cap how long an input reads
Design: Backpressure defers buffering to the forwarder
By design, the sidecar doesn't buffer events itself — that job belongs to your forwarder, which is purpose-built for it:
- Sidecar queue fills → input read blocks (the Unix socket/pipe)
- Forwarder detects the blocked pipe → OS socket/pipe buffer fills up
- Forwarder's buffering takes over (Fluentd file buffer, Fluent Bit mmap, OTel file_storage, Logstash persistent queue)
- Once the sidecar queue drains and load decreases, the forwarder drains its buffer
- No crashes, no heap overflow, no data loss
This approach is simpler, more reliable, and avoids duplicate buffering logic. Combined with fail-open design, log delivery is never interrupted.
Failure Modes
Is 10x highly available
No single point of failure. Edge apps run as DaemonSets — one sidecar per node. Each sidecar operates independently with no shared state or coordination between nodes. A failure on one node has no effect on any other node.
- Node-level isolation: Each sidecar processes only its own node's logs. No cross-node dependencies, no leader election, no quorum
- Fail-open: If a sidecar fails, the forwarder on that node continues sending logs at full volume. You temporarily lose cost savings on that node — not observability
- Automatic recovery: Standard Kubernetes restart policies and health probes restart failed sidecars without manual intervention
- Cloud apps: Cloud Reporter and Storage Streamer run as standard pods — scale replicas for redundancy. Both are read-only and async to the data flow
What happens if the Edge sidecar fails
Fail-open design with forwarder-specific recovery:
- During sidecar crash/OOM: Logs bypass optimization and flow at full volume directly from the forwarder to your destination (Splunk, Datadog, Elasticsearch, CloudWatch)
- Data loss: NO — fail-open design preserves all logs. You temporarily lose the cost benefit on that node only
- Cost impact: During the outage, that node's logs ship unoptimized. Once sidecar restarts, normal processing resumes
Recovery behavior depends on your forwarder:
| Forwarder | Recovery | Latency | Manual Action |
|---|---|---|---|
| Fluentd | Auto-respawn (built-in exec_filter) | <10s | None |
| Fluent Bit | Auto-respawn via Lua script (0-10x) | 0-50s | None (transient) / Check logs (persistent) |
| Filebeat | Auto-respawn via supervisor script + K8s probe | <10s | None (unless persistent failures) |
| OTel Collector | Auto-restart via K8s liveness probe | <30s | None |
For Fluentd & Filebeat (recommended): The forwarder automatically restarts the sidecar. You'll see a brief spike in unoptimized logs (10-30 seconds), then normal processing resumes.
For Fluent Bit: The Lua-based supervisor respawns 10x up to 10 times with exponential backoff. For transient crashes, recovery is automatic. For persistent crashes, check TENX_RUN_ARGS and logs after 50 seconds.
For OTel Collector: The 10x sidecar is a separate container in the same pod. A Kubernetes liveness probe monitors the sidecar and automatically restarts it if it crashes. OTel buffers events while 10x is down. Once the sidecar restarts, OTel drains the queue automatically. By design, you don't manage the restart — the cluster does.
Why fail-open? If the sidecar fails, your forwarder continues operating normally. No logs are lost, and observability continues unchanged — you just lose cost savings temporarily on that node.
Monitoring: Configure alerts for:
- pod_restart_count > 0 — track sidecar restarts
- Sidecar liveness probe failures — K8s will restart after 2 consecutive failures
What happens if the sidecar can't keep up with log volume
When the sidecar can't keep up, the forwarder buffers events to disk:
- Scenario: Log volume spikes beyond the sidecar's capacity (e.g., 100+ GB/day on a 512 MB, 2-thread node)
- What happens: Sidecar queue fills → sidecar stops reading from the forwarder → forwarder detects this and buffers events to disk instead
- Forwarder buffering: Each forwarder handles this natively (Fluentd file buffer, Fluent Bit mmap storage, OTel file_storage, Logstash persistent queue)
- Recovery: When the sidecar catches up and queue drains, the forwarder drains its buffer. All events are processed in order
- Data loss: NO — all events are queued or delivered
Prevention: Monitor sidecar CPU/memory usage. If consistently >90%, either: - Scale horizontally by adding more nodes (DaemonSet auto-deploys) - Scale vertically by increasing heap/thread limits - Reduce noisy log volume using Edge Regulator
Note: Backpressure is a protective feature, not a failure — it prevents cascade effects and protects downstream systems by letting the forwarder's buffering absorb the spike.
What if the forwarder crashes while the sidecar is running
Behavior depends on your deployment model — both safe:
Embedded sidecars (Fluentd, Fluent Bit, Logstash, Filebeat): - The 10x sidecar is spawned as a child process by the forwarder. When the forwarder crashes, the sidecar crashes with it - When the forwarder restarts, it automatically respawns 10x - Recovery times (same as when 10x crashes): - Fluentd: <10s (built-in respawn) - Fluent Bit: 0-50s (Lua supervisor with backoff) - Filebeat: <10s (supervisor script) — disk queue must be enabled in Filebeat config to buffer events while 10x is down - Logstash: <10s (pipe output auto-respawn) - Data loss: NO — your forwarder's buffer holds all events until 10x resumes - Cost impact: During downtime, logs ship unoptimized at full volume. Once 10x restarts, normal processing resumes
Standalone sidecar (OTel Collector): - The 10x sidecar is a separate container listening on a Unix socket. It waits passively for OTel to reconnect - OTel buffers events locally (configurable limit, typically 1GB = hours of data) while 10x is paused - Recovery: Automatic when OTel restarts - Data loss: NO — OTel's buffer preserves all events
In all cases: Your forwarder handles all input/output resilience (buffering, retries, batching). The sidecar just optimizes data in-memory and doesn't maintain external network connections.
What if network connectivity is interrupted
Two different paths — both safe:
- Sidecar ↔ Forwarder (local IPC): Both run on the same node. They communicate via local Unix sockets (or in-process if embedded). Network interruptions have NO effect on this path
- Forwarder ↔ SIEM (network outage): This is handled entirely by your forwarder (Fluentd, Filebeat, OTel Collector, etc.), not by 10x. Your forwarder's buffering, retries, and failover logic apply as normal
The 10x sidecar doesn't maintain external network connections. Log optimization happens locally and independently of network state. Network resilience is the responsibility of your forwarder and destination.
How does Filebeat handle sidecar crashes (Kubernetes)
Filebeat uses a supervisor script with Kubernetes liveness probe:
Filebeat is unique because it doesn't have native subprocess respawn like Fluentd. To safely run Filebeat with 10x in Kubernetes, we use a two-level recovery mechanism:
Level 1: Supervisor Script (in container) - When 10x crashes, the pipe breaks - Supervisor script detects this and respawns 10x automatically - Respawn happens in <5 seconds - Filebeat's output is buffered, so logs don't get blocked
Level 2: Kubernetes Liveness Probe (at pod level)
- Every 5 seconds, K8s checks if 10x process is running: pgrep -f '10x run'
- Also checks Filebeat HTTP endpoint: curl 127.0.0.1:5066
- If 10x is missing for 2 consecutive checks (10 seconds), K8s restarts the pod
- Acts as a safety net for persistent issues
Recovery Timeline: - T=0s: 10x crashes - T=<5s: Supervisor detects and respawns - T=5-10s: New 10x JVM warming up - T=10s: Ready, resume processing - Total: ~10 seconds
If respawn fails repeatedly: - Supervisor keeps retrying (5s delay between attempts) - K8s liveness probe detects the pattern after 2 failures - K8s kills and restarts the entire pod - Logs buffer during restart, resume after pod starts
Monitoring:
- Watch kubectl describe pod <filebeat-pod> for LastProbeTime on liveness probe
- Check pod logs for supervisor respawn messages
- Alert on pod_restart_count > 0 to catch persistent issues
This approach matches Fluentd's safety (automatic respawn) while working within Filebeat's architectural constraints (single output pipe).
What happens during other failures
Each scenario has independent recovery:
| Failure | Behavior |
|---|---|
| Node restart | Standard DaemonSet scheduling — sidecar starts with the node |
| Symbol library unavailable | Sidecar uses its last-fetched library version. JIT creates templates at runtime for unmatched events |
| Output destination down | Your forwarder handles output buffering and retries — 10x does not change this behavior |
| License server unreachable | Sidecar continues operating with cached license. No interruption to processing |
| Cloud Reporter/Streamer unavailable | Cloud apps are read-only and async to the data flow — failures don't block edge log delivery |
Monitoring: Use Prometheus metrics and Grafana dashboards with alerts to detect failures and coordinate recovery. DaemonSet restart policies and health probes ensure automatic restart without manual intervention.
What if the downstream SIEM is slow or unreachable
The sidecar is upstream of the SIEM — it outputs to the forwarder via local IPC, not directly to the SIEM. A slow or unreachable SIEM is your forwarder's concern. Fluentd, Fluent Bit, OTel Collector, and the others all have their own output buffering, retries, and backoff for this scenario. The sidecar is unaffected.
Fail-open still applies: if the sidecar itself goes down while the SIEM is slow, logs bypass optimization and flow directly from the forwarder to the destination — no logs lost.
Operations & Deployment
How do I roll back
Remove the DaemonSet. Logs flow directly from your forwarder to your analytics tool — no 10x in the path. Rollback is a single Helm command:
Or roll back to a previous chart version:
- Blast radius: Only the 10x sidecar is removed. Your forwarder (Fluentd, Fluent Bit, OTel, Filebeat) continues running unchanged
- Previously stored data: Compact data already indexed in your analyzer (Splunk, Elasticsearch) remains searchable via the expansion plugin or Storage Streamer. Removing 10x does not affect previously indexed data
- Logs during rollback: Once the sidecar is removed, logs ship at full volume to your analyzer
- No data migration needed — all logs stay in your infrastructure
How do I upgrade in production
Standard Helm upgrade with rolling updates. Kubernetes updates one DaemonSet pod at a time — no cluster-wide downtime:
helm repo update
helm upgrade my-edge-optimizer log10x-fluent/fluent-bit \
-f my-edge-optimizer.yaml \
--namespace logging
- Rolling update: Each node's pod restarts individually. While a pod restarts, logs on that node may be dropped until the sidecar recovers
- Symbol library updates: Distributed via GitOps. Edge and cloud apps pull updates automatically at a configurable interval — no restart needed
- Configuration changes: Managed via GitOps or Helm values. Changes propagate to running pods without manual per-node intervention
Delivery Guarantees
Can events be processed twice? What are the ordering guarantees
At-least-once delivery. If the forwarder sends an event twice — for example, due to a retry after a sidecar restart — it will be processed twice. The sidecar has no deduplication.
Ordering is preserved. Events are processed in the order they arrive via stdin or Unix socket. The sidecar preserves forwarder-side ordering — no events are reordered.
What happens to templates when the sidecar restarts
Templates are rebuilt from the AOT symbol library on startup — nothing is persisted across restarts. Events that arrive before the library finishes loading are handled by the JIT engine, which builds templates at runtime from the events themselves.
Product Safety Features
Does Log10x have a shadow/pass-through mode for safe evaluation
Edge Reporter is the shadow mode. It runs as a read-only sidecar — same DaemonSet deployment, same forwarder integration — but never modifies or optimizes events. It reports cost attribution and reduction projections before you deploy Edge Optimizer.
Edge Optimizer itself has no pass-through mode. The standard evaluation workflow is: deploy Edge Reporter first, validate the projections, then swap in Edge Optimizer when you're confident.
Is Edge Regulator lossy? How do I protect critical events
Yes — Edge Regulator is explicitly sampling-based and lossy by design. Events that exceed their budget share are sampled down proportionally. If DEBUG events exceed 20% of your hourly budget, excess ones are dropped. That's the point — it's a budget enforcement tool.
How to protect critical events:
- Severity boost: ERROR and WARN get a higher retention multiplier — they're sampled last, after INFO and DEBUG are reduced
- Per-type budget floors: Configure a minimum retention floor per event type so nothing is sampled below a threshold
- Allowlist: Don't regulate event types you can't afford to lose (security logs, audit logs, compliance events)
Edge Optimizer (lossless) is a completely different product — it never drops events. Use Edge Reporter + Edge Optimizer for cost reduction without data loss. Add Edge Regulator only for event types you've validated are safe to sample.
Does Edge Optimizer deduplicate events
No — this is not deduplication. Edge Optimizer stores the structure of a log type (field names, patterns, positions) once per type. Each event stores only its unique values for that structure.
Two nginx 404s with different URLs remain fully separate events — both URLs are preserved. Nothing is merged, fingerprinted, or deduplicated. Think of it like Protocol Buffers: the schema is shared, but every message retains its own data.
Throughput & Limits
What are the throughput limits
The sidecar processes ~1 MB/sec per thread at 512 MB heap — roughly 86 GB/day per thread at baseline. Throughput scales linearly with thread count:
| Config | Throughput | Daily volume |
|---|---|---|
| 512 MB heap, 1 thread | ~1 MB/s | ~86 GB/day |
| 512 MB heap, 2 threads | ~2 MB/s | ~170 GB/day |
| 1 GB heap, 4 threads | ~4 MB/s | ~340 GB/day |
| 2 GB heap, 8 threads | ~8 MB/s | ~690 GB/day |
When the sidecar exceeds its processing capacity, backpressure activates. The sidecar blocks on input read, and your forwarder buffers events to disk. The sidecar never crashes from volume alone — it blocks, allowing the forwarder to handle the backlog.
Data Portability
Can I expand events if I stop using Log10x
Yes — the data format is fully readable and the expansion tools are open source:
- Splunk: The 10x for Splunk app is open source on GitHub, with documented expand utilities
- Elasticsearch: The L1ES Lucene plugin is open source on GitHub, with documented expand utilities
- Python utility: A standalone Python expander is available for scripting and offline access
Events are stored as readable text — not binary blobs. You can inspect, expand, and migrate them without Log10x running. No vendor lock-in on the data format.
Pre-Production Testing
How should I test before production
Three-phase approach — local validation, staging verification, then phased production rollout:
Phase 1: Local Dev Tool (30 minutes)
- Export sample logs from production (100K events, representative traffic)
- Run Dev locally on your sample
- Verify: - Optimization/filtering ratio matches expected range - No data loss (all fields present in expanded output) - Parsing accuracy (custom formats detected correctly)
Phase 2: Staging DaemonSet (24 hours)
- Deploy sidecar as DaemonSet in staging cluster (same config as production target)
- Run your actual staging workload through it
- Monitor: - Sidecar pod health (1/1 Running, 0 restarts) - CPU/memory usage (should be <80%) - Optimization ratio (should match Dev tool results) - Query latency in staging SIEM (expect ~1.25x overhead for Splunk/ES)
- Validate dashboard queries still work
- Leave running for 24 hours — check for OOM/restart patterns or memory leaks
Phase 3: Production Rollout (staged canary)
- Canary (Day 1): Deploy to 1–2 nodes. Monitor for 24 hours
- Ramp (Days 2–3): Deploy to 25% of nodes. Monitor for issues
- Full (Day 4+): Deploy to remaining nodes. Monitor continuously
What should I validate before each deployment phase
Pre-deployment checklist:
- [ ] Sidecar pod is healthy (1/1 Running, 0 restarts)
- [ ] CPU usage <70% p95
- [ ] JVM heap not approaching
-Xmxlimit - [ ] Logs arriving at SIEM (no drop in event count)
- [ ] Optimization ratio ±5% of expected (e.g., expected 60%, actual 55–65%)
- [ ] Query latency acceptable (if using Splunk/ES, expect ~1.25x overhead)
- [ ] Dashboards and saved searches render correctly
- [ ] Alerts firing normally (no delays or delivery failures)
- [ ] Cost metrics showing in Log10x Console (if available)
- [ ] No unexpected OOM or restart events in the previous 24 hours
If any check fails: Do not proceed to next phase. Investigate root cause before advancing, or contact support for help.
How do I roll back if something goes wrong
Immediate rollback:
- Execution time: ~1 minute for the DaemonSet to terminate all pods
- Log delivery: Forwarders automatically resume normal operation (direct to destination)
- Data loss: NO — nothing is lost during rollback
- Recovery monitoring: After 2 minutes, log volume at SIEM should return to normal (pre-optimization rate)
Production safety checklist: Before, during, and after deployment
Pre-deployment (Day 0):
- [ ] Staging validation passed for 24+ hours without issues
- [ ] Canary deployment plan documented (which nodes, rollback trigger)
- [ ] Alerting configured for sidecar health:
- [ ] Sidecar restarts (
pod_restart_count > 0) - [ ] High CPU (
container_cpu_usage > 80%) - [ ] High memory (
container_memory_usage > 90%) - [ ] Slow processing (
engine_process_duration_p99 > 50ms) - [ ] Rollback procedure tested (helm uninstall command ready)
- [ ] Team trained on what to monitor during deployment
- [ ] On-call escalation path defined (who to contact if issues arise)
During canary deployment (Days 1-3):
-
Canary phase (Day 1: 1-2 nodes):
- [ ] Sidecar pods healthy:
kubectl get pods -l app=log10x-optimizer - [ ] No restart loops:
kubectl logs -l app=log10x-optimizer | grep ERROR - [ ] CPU/memory within limits:
kubectl top pods -l app=log10x-optimizer - [ ] Events arriving at SIEM: Verify ingestion rate hasn't dropped >5%
- [ ] Optimization ratio confirmed: Check metrics in Log10x Console or Prometheus
- [ ] No alerts triggered: Monitor alerting system for sidecar health alerts
- Hold for 24 hours before proceeding
- [ ] Sidecar pods healthy:
-
Ramp phase (Days 2-3: 25% of fleet):
- [ ] Repeat canary checks on new nodes
- [ ] Compare SIEM query latency across canary vs ramp nodes (expect ~1.25x overhead)
- [ ] Check for pod affinity issues (pods not scheduling evenly)
- [ ] Verify no correlated failures (e.g., all pods on one type failing)
-
Full deployment (Days 4+):
- [ ] All nodes running sidecar successfully
- [ ] Fleet-wide metrics stable (no degradation)
- [ ] Cost metrics showing in reporting dashboard
- [ ] Team confident in monitoring procedures
Post-deployment (ongoing):
-
Weekly:
- [ ] Sidecar restart count trending down (expect near-zero)
- [ ] CPU/memory usage stable and predictable
- [ ] Optimization ratio consistent with expectations
- [ ] SIEM query performance stable (no regression)
-
Monthly:
- [ ] Cost savings tracking against projections
- [ ] No memory leak indicators (heap size not growing unbounded)
- [ ] Update sidecar image to latest patch version
- [ ] Review and tune
threadPoolSizeif CPU utilization changed
-
At first sign of problems:
- [ ] Check sidecar logs:
kubectl logs -l app=log10x-optimizer --tail=100 - [ ] Review metrics: Check CPU, memory, and processing latency in Prometheus
- [ ] Verify forwarder health: Is the forwarder still running normally?
- [ ] If unrecoverable: Execute rollback (
kubectl delete daemonset log10x-optimizer)
- [ ] Check sidecar logs: