FAQ

The Storage Streamer stores logs in object storage (S3, Azure Blobs) and streams selected events to your SIEM (Splunk, Datadog, Elastic) on demand.

Overview

What is Storage Streamer and how does it work

Storage Streamer stores logs in S3 at $0.023/GB/month and indexes them at ingest time.

When you query, the system scans the index to find which files contain matching data. Only those files are streamed to your SIEM.

Ingestion costs apply only to what you query -- typically 5-30% of total volume -- not all your logs.

How does Storage Streamer reduce costs

Store 100% of logs in S3 at a fraction of analyzer costs. Pay your SIEM license only on the data you actually query -- typically 5-30% of total volume.

Typical cost reduction is 70-80%, depending on query patterns. See the pricing page for details.

Querying & Performance

How do engineers search S3 data during an incident

Queries are initiated via REST API -- from a script, runbook, or CronJob. Results stream back into your existing analyzer.

Send a query with time range and search expression via POST /streamer/query
Bloom filter index identifies matching S3 files (<1 second)
Matching events stream through Fluent Bit to your analyzer (Splunk HEC, Elasticsearch Bulk API, Datadog, CloudWatch)
Events appear in Kibana / Splunk Search / Datadog Logs with original timestamps -- alongside your live data

Example -- find all payment errors in the last 6 hours:

curl -X POST http://streamer:8080/streamer/query \
  -d '{"from":"now(\"-6h\")","to":"now()",
       "search":"level == \"ERROR\" && message.includes(\"payment\")"}'

No separate UI to learn. Results are standard indexed events in your existing tool -- search, filter, and dashboard them the same way you always do. Events are permanently ingested; your analyzer's standard retention policy applies.

For recurring workflows (dashboard population, compliance scans), schedule queries via Kubernetes CronJob. See the query reference for search syntax and filtering options.

How do I query logs in object storage

Search by time range, source, and keywords.

Storage Streamer scans the search index to identify which files contain matching data -- without reading the files themselves.
Only matching files are streamed to your SIEM.
Full-text search and analysis happen in your existing tool.

Retrieval times depend on result set size (see below).

How fast is data retrieval

Index lookups identify matching files in under 1 second. Fetching and streaming events depends on result set size and parallel worker configuration:

Baseline performance:

~10K events: 10-30 seconds with default parallel scan/stream configuration
Index filters: Identify matching files in <1 second (Bloom filter accuracy is configurable, default ~1% false positive rate)
Network limits: S3 API throughput and network bandwidth are your practical limits at scale

How parallelism affects performance:

Storage Streamer uses parallel scan and stream workers to fetch and parse events concurrently. Configure via:

queryScanFunctionParallelTimeslice — max time range per scan worker (e.g., 1m = each worker scans 1 minute of index)
queryScanFunctionParallelMaxInstances — max number of parallel scan workers (default 1000)
queryStreamFunctionParallelObjects — max byte ranges per stream worker (default 50)

Example scaling: - 100K events over a 10-minute time range with 1-minute timeslice = ~10 parallel scan workers executing simultaneously - Total query time approaches the single-worker baseline (~10-30 sec) rather than scaling linearly with result set size - Actual time depends on: file size distribution, S3 API rate limits, network bandwidth, and log parsing overhead

Real-world incident response: - Incident: Payment processing errors spike - Query: level == "ERROR" && service == "payments" for last 20 minutes - Result: ~5K matching events (typically identified in <1-2 seconds by filters) - Stream time: 10-20 seconds (depends on configured parallelism) - Time to dashboard: <5 seconds (events appear in Kibana/Splunk after streaming completes) - Total to triage: <30 seconds

Query optimization tips: - Use time range filters to reduce scan volume - Add service/host filters to narrow scope - For massive result sets (>10M), split into smaller time windows - Use query limits to prevent runaway queries

When to use Storage Streamer: - Investigation, compliance searches, historical analysis (seconds to minutes of latency acceptable) - Not suitable for: Sub-second alerting — keep critical log types streaming to your primary SIEM instead

Storage tiers: S3 Standard provides instant access. S3 Glacier requires restore time.

What do compact events look like? See real-world before/after examples showing how events are optimized before archival to S3.

What cloud storage services are supported

Currently supported:

AWS S3 -- primary support, most customers deploy here
Azure Blob Storage

Works with your existing bucket structure and log formats: JSON, plain text, and gzipped files. Common storage tiers: S3 Standard ($0.023/GB/month) or S3 Intelligent-Tiering for older data.

Google Cloud Storage is on the roadmap.

Which SIEMs can receive streamed data

Native integrations:

Splunk (HEC)
Elastic / OpenSearch
Datadog
AWS CloudWatch

Generic: Any HTTP endpoint, syslog, or TCP destination. Logs stream with original timestamps preserved.

For Splunk users, the optional Edge Optimizer compacts events at the edge for an additional 50% reduction in ingestion volume. Events are expanded transparently via the 10x for Splunk app at search time.

Architecture & Indexing

How are S3 Bloom filter indexes built and what compute resources do they use

Where indexing happens: Index workers running in EKS pods build Bloom filter indexes as files upload to S3.

Workflow:

File uploads to S3 — your forwarder (Fluent Bit, Filebeat, Logstash) writes logs to S3 bucket
S3 sends notification to SQS queue — S3 event notification triggers immediately (configured during deployment)
Index worker pods consume from queue — streamer deployment's "index" role pods pull work from SQS
Worker reads file from S3 and builds index — parses events, extracts template hashes and variable values
Bloom filters written to S3 index bucket — lightweight index objects (typically <1KB per filter) stored alongside original data
Query workers use indexes to skip files — when you run a query, index scans filters in <1 second to find matching files

Compute resources for indexing:

Indexing is CPU and memory intensive during file parsing. Default EKS pod resources: - 1 CPU and 2GB memory per pod (see deployment guide) - Autoscaling: 2-10 replicas depending on queue depth (default 2 min, scales to 10 if backlog grows) - Throughput: One pod handles ~10-50 GB/day depending on event size and CPU availability

When indexing runs:

Asynchronous — triggered immediately by S3 event notification, runs in parallel with queries
Batch processing — multiple index workers process files concurrently from the SQS queue
No re-indexing — indexes are built once at ingest time, never recomputed for queries

Cost implications:

Index building cost is baked into the EKS pod resource costs — no per-GB indexing fee. You pay: - EKS pod (CPU + memory) running the index workers - S3 storage for index objects (~1-5% overhead vs. original data size) - SQS queue operations (~$0.40 per million messages)

Optimization:

All-in-one deployment: Single pod cluster handles index, query, and stream roles (simpler, suitable for <100 GB/day)
Separate clusters: Dedicated index/query/stream pods allow independent scaling (recommended for >500 GB/day)
See deployment topologies for sizing guidance

What happens if index building falls behind

SQS queue buffers work — no events lost:

Files upload faster than indexing: SQS queue grows (default limit: 1000s of messages)
Indexing catches up: Additional index worker pods scale up automatically (via Kubernetes HPA)
No data loss: Files remain in S3 unindexed, but queries still work (they just read unindexed files)
Query performance: Unindexed files require full scan — slower than indexed queries but still functional

Monitoring index backlog:

# Check SQS queue depth (number of pending index jobs)
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/ACCOUNT/index-queue \
  --attribute-names ApproximateNumberOfMessages

# Check index worker pod status
kubectl get pods -n log10x-streamer -l role=index

# Check autoscaling status
kubectl get hpa -n log10x-streamer

Scaling policy:

By default, index workers scale to 10 replicas if queue depth exceeds threshold. For high-volume environments (>1 TB/day), adjust maxReplicas in your Helm values:

clusters:
  - name: index-workers
    roles: ["index"]
    maxReplicas: 50  # Increase for very high volume
    targetCPUUtilizationPercentage: 70

Use Cases

What are the main use cases?

Incident investigation, scheduled dashboard population, compliance and audit, and metric aggregation — each with concrete savings examples per SIEM vendor. See the full use cases section for query workflows, retrieval times, and cost breakdowns.

Query Limits

How do I prevent accidental over-ingestion when streaming from S3

Storage Streamer enforces automatic cost guardrails on every query to prevent expensive accidents:

Default limits per query: - Processing time: 1 minute max execution — query terminates automatically - Result size: 10 MB max bytes returned — query stops once reached

Examples:

Scenario 1: Quick incident investigation (safe defaults)

curl -X POST http://streamer:8080/streamer/query \
  -d '{"from":"now(\"-1h\")","to":"now()","search":"level == \"ERROR\""}'
# Stops automatically after 1 minute or 10 MB, whichever comes first
# Typical cost: $0.10-$0.50 depending on volume

Scenario 2: Large historical pull (custom limits)

# Configure via Kubernetes values or query config to increase limits if needed:
# queryLimitProcessingTime: 5m  # Increase to 5 minutes
# queryLimitResultSize: 500MB   # Increase to 500 MB
# Only do this when you understand the cost implications

Safety mechanisms: - Queries abort gracefully when limits hit — no partial ingestion - Default 10 MB limit at ~$0.025/GB ingestion cost = ~$0.00025 per query max - For 1 TB daily ingestion, even 100 queries hit only $0.025 extra cost - Override limits explicitly in config — no silent surprises

Monitoring: - Check query execution logs for "limit exceeded" messages - Monitor S3 scan logs to see how many files matched your query - Set up Kubernetes alerts on streamer pod for failed queries

What happens if a query exceeds its limits

Processing time exceeded (1 minute default): - Query terminates gracefully - Partial results already streamed to analyzer remain ingested - No rollback — the analyzer retains what was sent before timeout - Recommendation: adjust query to use narrower time range or add more filters

Result size exceeded (10 MB default): - Query stops reading more files - All events read so far are streamed to analyzer - No cost surprise — you get capped at expected bytes - Recommendation: add search filters to narrow scope (e.g., higher log level, specific service)

Both are safe failures — worst case you ingest partial results, not the entire historical dataset. Adjust limits per use case: tight for incident response, looser for scheduled compliance pulls.

Comparisons

How does Storage Streamer compare to Datadog Archives

Datadog offers three tiers for archived data, each with significant trade-offs:

	Archive Search	Flex Logs	Rehydration	Storage Streamer
Method	Full scan of gzipped JSON -- no indexes	Columnar indexes (Husky)	Batch re-index into hot tier	Bloom filter indexes on your S3
Cost	$0.10/GB scanned per query	Datadog-managed storage fees	Full indexing cost on rehydrated volume	S3 storage only -- no per-query fees
Speed	Slow -- reads every byte in range	Fast	Hours to complete	Seconds -- skips 99%+ of files
Analytics	None -- flat event list, 100K cap, 24hr expiry	Full Datadog analytics	Full (once rehydrated)	Full -- in any SIEM you choose
Data ownership	Your S3 bucket	Datadog-managed, proprietary format -- no export	Your S3 → Datadog hot tier	Your S3 bucket, open formats

How does Storage Streamer compare to Splunk Federated Search for S3

Splunk's Federated Search for Amazon S3 (GA on Splunk Cloud) queries S3 data via AWS Glue. It's scan-based -- no indexes.

Key limitations:

~100 seconds per TB scanned, 10 TB max per search, 100K event default cap
Splunk Cloud on AWS only -- no Enterprise, no on-prem, no FedRAMP
Requires AWS Glue Data Catalog (additional AWS cost)
Licensed via Data Search Units (DSUs) -- Splunk warns high-frequency use may cost more than native ingestion

Storage Streamer difference:

Bloom filter indexes skip 99%+ of files -- no full-scan pricing or per-TB latency
No 10 TB cap
Works with Splunk Cloud and Enterprise (on-prem)
Results stream as indexed events with original timestamps -- full SPL analytics
Optional Edge Optimizer adds lossless 50% volume reduction

How does Storage Streamer compare to Cribl

Cribl's suite has three paid layers:

Stream -- routes and filters data before your SIEM
Lake -- stores data (BYOS option keeps it in your S3, but charges $0.02/GB/month management fee)
Search -- queries Lake data using query-time compute

Storage Streamer difference:

Logs stay in your S3 bucket with no additional management fees
Search via pre-computed indexes, not query-time compute
Results stream to your existing SIEM (Splunk, Elastic, Datadog)
Uses your existing query tools. No per-GB fees beyond S3 storage

Failure Modes & Recovery

What if a Cloud Streamer pod crashes

Pod restart — no data loss, stream resumes from last checkpoint:

During pod crash: Active S3 stream pauses until pod restarts
Data loss: NO — events remain in S3. Stream resumes from checkpoint on pod restart
Recovery: Kubernetes restarts pod automatically (typically <30s)
Cost impact: Brief gap in streaming. Once pod resumes, events stream normally

Monitoring: - pod_restart_count > 0 — track streamer pod restarts - Pod status: kubectl get pods -l app=log10x-streamer -n logging - Stream lag (if applicable): Check metrics for events-behind or S3-scan-lag

What if S3 becomes unavailable

Streamer pauses gracefully — retries on S3 recovery:

S3 connection lost: Streamer can't read compact events from S3
Data loss: NO — events stay in S3, streamer just waits
Recovery: Automatic when S3 recovers. Streamer resumes streaming from checkpoint
Timeline: Minutes to hours depending on S3 outage duration

Verification: - Check S3 bucket status: aws s3 ls s3://your-bucket/ --region your-region - Check IAM role permissions on streamer pod - Check streamer logs for S3 auth/connection errors

What if your destination log analyzer is unavailable

Streamer buffers locally and retries — no event loss:

Analyzer down (Splunk, Elasticsearch, Datadog, etc.): Streamer can't ship events
Behavior: Local queue fills. Streamer backs off and retries (exponential backoff)
Data loss: NO — events queue in pod memory and retry until analyzer recovers
Buffer limit: Default 1GB in-memory queue. If exceeded, events are dropped (with warnings in logs)
Recovery: Automatic when analyzer recovers. Queued events flush immediately

Prevention: - Monitor analyzer health from your alerting tool - Increase local buffer size if you expect extended analyzer downtime: streamer.bufferSize: 2Gi - Add failover analyzer endpoint if available

Data Portability & Format Independence

Is the compact format proprietary or can I decode independently

The format is NOT proprietary. You can decode independently using open-source tools.

The Log10x compact format is fully decodable without relying on Log10x infrastructure. Multiple open-source decoders are available:

1. Splunk App (Apache 2.0) - 10x for Splunk — open-source app that automatically expands compact events - Proves format is not proprietary — Splunk app is maintained by the community - Works as search-time macro: all your existing KQL queries work unchanged

2. Java Decoder (Apache 2.0) - Log10x Decoder for Java — standalone library and CLI - Published on Maven Central - Use as a library: import com.log10x.decode.SingleEventDecoder - Use as CLI tool: log10x-decode -t templates.json -f encoded.log -o decoded.log

3. Protocol Documentation - Full encoding/decoding specification in documentation - Open format — no secrets, no proprietary algorithms

What this means for your data:

Your data is yours: All logs remain in your S3 bucket in a decodable format
No vendor lock-in: You can decode independently at any time
Exit strategy: If Log10x goes away, your data is still accessible
Compliance: Meets data portability and ownership requirements

What's the exit strategy if I want to stop using Log10x

Complete data independence — no lock-in:

Step 1: Export all data from S3

# Query Storage Streamer to pull all archived events
curl -X POST https://streamer.log10x.com/query \
  -d '{
    "from": "1970",
    "to": "now()",
    "search": "*",
    "elasticsearch_endpoint": "https://your-es.example.com"
  }'
# Or stream to S3 output directly

Step 2: Decode using open-source decoder

# Download decoder CLI
wget https://github.com/log-10x/log10x-decoder-java/releases/download/v0.9.0/log10x-decoder-cli-0.9.0-all.jar

# Decode your S3 files locally
java -jar log10x-decoder-cli-0.9.0-all.jar -t templates.json -f encoded.log -o decoded.log

Step 3: Export to open format

# Your decoded logs are now in original JSON/text format
# Export to any destination: SIEM, data lake, archive storage, etc.
aws s3 cp decoded.log s3://your-backup-bucket/

Result: Full data portability in open formats. No data loss, no proprietary dependencies.

What format are the compact events stored in

Human-readable, standardized format:

Each encoded event follows this structure:

~<template_hash>,<variable_0>,<variable_1>,...

Example:

~abc123def,1705315845123,john_doe,TX-789012

Template mapping (stored separately):

{
  "templateHash": "abc123def",
  "template": "$(epoch) INFO [main] Processing request for user $ with transaction id $"
}

When decoded: Original event is reconstructed:

2024-01-15T10:30:45.123Z INFO [main] Processing request for user john_doe with transaction id TX-789012

Storage: Templates are stored in tenx_dml index (Splunk) or separate index (other SIEMs) as searchable events. Format is standard JSON — nothing proprietary.

How do I ensure my archived data remains accessible long-term

Best practices for data longevity:

Keep templates alongside encoded events - Template index retention policy should match or exceed your S3 archive retention - Templates are the "key" to decoding — losing them makes archived data unreadable - Typical: 7-year retention for compliance archives, templates stored in same S3 bucket
Export decoded data periodically - Schedule monthly exports of critical data using the Java decoder - Store decoded data in open formats (JSON, CSV, Parquet) in your S3 bucket - Backup to separate cloud provider or on-prem storage for disaster recovery
Document the encoding scheme - Keep the encoding specification and decoder tools with your archived data - GitHub links to open-source decoders are stable and version-tagged - Consider archiving decoder binaries alongside your logs
Test decoding annually - Pull a sample of encoded events from S3 - Decode using the Java decoder - Verify decoded output matches your compliance requirements - This validates both the format and your recovery process

Long-term guarantee: Encoded data is independently decodable via open-source tools. Format specification is publicly documented. No dependency on Log10x service to access your own data.

Next: Edge