FAQ
The Storage Streamer stores logs in object storage (S3, Azure Blobs) and streams selected events to your SIEM (Splunk, Datadog, Elastic) on demand.
Overview
What is Storage Streamer and how does it work
Storage Streamer stores logs in S3 at $0.023/GB/month and indexes them at ingest time.
When you query, the system scans the index to find which files contain matching data. Only those files are streamed to your SIEM.
Ingestion costs apply only to what you query -- typically 5-30% of total volume -- not all your logs.
How does Storage Streamer reduce costs
Store 100% of logs in S3 at a fraction of analyzer costs. Pay your SIEM license only on the data you actually query -- typically 5-30% of total volume.
Typical cost reduction is 70-80%, depending on query patterns. See the pricing page for details.
Querying & Performance
How do engineers search S3 data during an incident
Queries are initiated via REST API -- from a script, runbook, or CronJob. Results stream back into your existing analyzer.
- Send a query with time range and search expression via
POST /streamer/query - Bloom filter index identifies matching S3 files (<1 second)
- Matching events stream through Fluent Bit to your analyzer (Splunk HEC, Elasticsearch Bulk API, Datadog, CloudWatch)
- Events appear in Kibana / Splunk Search / Datadog Logs with original timestamps -- alongside your live data
Example -- find all payment errors in the last 6 hours:
curl -X POST http://streamer:8080/streamer/query \
-d '{"from":"now(\"-6h\")","to":"now()",
"search":"level == \"ERROR\" && message.includes(\"payment\")"}'
No separate UI to learn. Results are standard indexed events in your existing tool -- search, filter, and dashboard them the same way you always do. Events are permanently ingested; your analyzer's standard retention policy applies.
For recurring workflows (dashboard population, compliance scans), schedule queries via Kubernetes CronJob. See the query reference for search syntax and filtering options.
How do I query logs in object storage
Search by time range, source, and keywords.
- Storage Streamer scans the search index to identify which files contain matching data -- without reading the files themselves.
- Only matching files are streamed to your SIEM.
- Full-text search and analysis happen in your existing tool.
Retrieval times depend on result set size (see below).
How fast is data retrieval
Index lookups identify matching files in under 1 second. Fetching and streaming events depends on result set size and parallel worker configuration:
Baseline performance:
- ~10K events: 10-30 seconds with default parallel scan/stream configuration
- Index filters: Identify matching files in <1 second (Bloom filter accuracy is configurable, default ~1% false positive rate)
- Network limits: S3 API throughput and network bandwidth are your practical limits at scale
How parallelism affects performance:
Storage Streamer uses parallel scan and stream workers to fetch and parse events concurrently. Configure via:
queryScanFunctionParallelTimeslice— max time range per scan worker (e.g.,1m= each worker scans 1 minute of index)queryScanFunctionParallelMaxInstances— max number of parallel scan workers (default 1000)queryStreamFunctionParallelObjects— max byte ranges per stream worker (default 50)
Example scaling: - 100K events over a 10-minute time range with 1-minute timeslice = ~10 parallel scan workers executing simultaneously - Total query time approaches the single-worker baseline (~10-30 sec) rather than scaling linearly with result set size - Actual time depends on: file size distribution, S3 API rate limits, network bandwidth, and log parsing overhead
Real-world incident response:
- Incident: Payment processing errors spike
- Query: level == "ERROR" && service == "payments" for last 20 minutes
- Result: ~5K matching events (typically identified in <1-2 seconds by filters)
- Stream time: 10-20 seconds (depends on configured parallelism)
- Time to dashboard: <5 seconds (events appear in Kibana/Splunk after streaming completes)
- Total to triage: <30 seconds
Query optimization tips: - Use time range filters to reduce scan volume - Add service/host filters to narrow scope - For massive result sets (>10M), split into smaller time windows - Use query limits to prevent runaway queries
When to use Storage Streamer: - Investigation, compliance searches, historical analysis (seconds to minutes of latency acceptable) - Not suitable for: Sub-second alerting — keep critical log types streaming to your primary SIEM instead
Storage tiers: S3 Standard provides instant access. S3 Glacier requires restore time.
What do compact events look like? See real-world before/after examples showing how events are optimized before archival to S3.
What cloud storage services are supported
Currently supported:
- AWS S3 -- primary support, most customers deploy here
- Azure Blob Storage
Works with your existing bucket structure and log formats: JSON, plain text, and gzipped files. Common storage tiers: S3 Standard ($0.023/GB/month) or S3 Intelligent-Tiering for older data.
Google Cloud Storage is on the roadmap.
Which SIEMs can receive streamed data
Native integrations:
- Splunk (HEC)
- Elastic / OpenSearch
- Datadog
- AWS CloudWatch
Generic: Any HTTP endpoint, syslog, or TCP destination. Logs stream with original timestamps preserved.
For Splunk users, the optional Edge Optimizer compacts events at the edge for an additional 50% reduction in ingestion volume. Events are expanded transparently via the 10x for Splunk app at search time.
Architecture & Indexing
How are S3 Bloom filter indexes built and what compute resources do they use
Where indexing happens: Index workers running in EKS pods build Bloom filter indexes as files upload to S3.
Workflow:
- File uploads to S3 — your forwarder (Fluent Bit, Filebeat, Logstash) writes logs to S3 bucket
- S3 sends notification to SQS queue — S3 event notification triggers immediately (configured during deployment)
- Index worker pods consume from queue — streamer deployment's "index" role pods pull work from SQS
- Worker reads file from S3 and builds index — parses events, extracts template hashes and variable values
- Bloom filters written to S3 index bucket — lightweight index objects (typically <1KB per filter) stored alongside original data
- Query workers use indexes to skip files — when you run a query, index scans filters in <1 second to find matching files
Compute resources for indexing:
Indexing is CPU and memory intensive during file parsing. Default EKS pod resources: - 1 CPU and 2GB memory per pod (see deployment guide) - Autoscaling: 2-10 replicas depending on queue depth (default 2 min, scales to 10 if backlog grows) - Throughput: One pod handles ~10-50 GB/day depending on event size and CPU availability
When indexing runs:
- Asynchronous — triggered immediately by S3 event notification, runs in parallel with queries
- Batch processing — multiple index workers process files concurrently from the SQS queue
- No re-indexing — indexes are built once at ingest time, never recomputed for queries
Cost implications:
Index building cost is baked into the EKS pod resource costs — no per-GB indexing fee. You pay: - EKS pod (CPU + memory) running the index workers - S3 storage for index objects (~1-5% overhead vs. original data size) - SQS queue operations (~$0.40 per million messages)
Optimization:
- All-in-one deployment: Single pod cluster handles index, query, and stream roles (simpler, suitable for <100 GB/day)
- Separate clusters: Dedicated index/query/stream pods allow independent scaling (recommended for >500 GB/day)
- See deployment topologies for sizing guidance
What happens if index building falls behind
SQS queue buffers work — no events lost:
- Files upload faster than indexing: SQS queue grows (default limit: 1000s of messages)
- Indexing catches up: Additional index worker pods scale up automatically (via Kubernetes HPA)
- No data loss: Files remain in S3 unindexed, but queries still work (they just read unindexed files)
- Query performance: Unindexed files require full scan — slower than indexed queries but still functional
Monitoring index backlog:
# Check SQS queue depth (number of pending index jobs)
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/ACCOUNT/index-queue \
--attribute-names ApproximateNumberOfMessages
# Check index worker pod status
kubectl get pods -n log10x-streamer -l role=index
# Check autoscaling status
kubectl get hpa -n log10x-streamer
Scaling policy:
By default, index workers scale to 10 replicas if queue depth exceeds threshold. For high-volume environments (>1 TB/day), adjust maxReplicas in your Helm values:
Use Cases
What are the main use cases?
Incident investigation, scheduled dashboard population, compliance and audit, and metric aggregation — each with concrete savings examples per SIEM vendor. See the full use cases section for query workflows, retrieval times, and cost breakdowns.
Query Limits
How do I prevent accidental over-ingestion when streaming from S3
Storage Streamer enforces automatic cost guardrails on every query to prevent expensive accidents:
Default limits per query: - Processing time: 1 minute max execution — query terminates automatically - Result size: 10 MB max bytes returned — query stops once reached
Examples:
Scenario 1: Quick incident investigation (safe defaults)
curl -X POST http://streamer:8080/streamer/query \
-d '{"from":"now(\"-1h\")","to":"now()","search":"level == \"ERROR\""}'
# Stops automatically after 1 minute or 10 MB, whichever comes first
# Typical cost: $0.10-$0.50 depending on volume
Scenario 2: Large historical pull (custom limits)
# Configure via Kubernetes values or query config to increase limits if needed:
# queryLimitProcessingTime: 5m # Increase to 5 minutes
# queryLimitResultSize: 500MB # Increase to 500 MB
# Only do this when you understand the cost implications
Safety mechanisms: - Queries abort gracefully when limits hit — no partial ingestion - Default 10 MB limit at ~$0.025/GB ingestion cost = ~$0.00025 per query max - For 1 TB daily ingestion, even 100 queries hit only $0.025 extra cost - Override limits explicitly in config — no silent surprises
Monitoring: - Check query execution logs for "limit exceeded" messages - Monitor S3 scan logs to see how many files matched your query - Set up Kubernetes alerts on streamer pod for failed queries
What happens if a query exceeds its limits
Processing time exceeded (1 minute default): - Query terminates gracefully - Partial results already streamed to analyzer remain ingested - No rollback — the analyzer retains what was sent before timeout - Recommendation: adjust query to use narrower time range or add more filters
Result size exceeded (10 MB default): - Query stops reading more files - All events read so far are streamed to analyzer - No cost surprise — you get capped at expected bytes - Recommendation: add search filters to narrow scope (e.g., higher log level, specific service)
Both are safe failures — worst case you ingest partial results, not the entire historical dataset. Adjust limits per use case: tight for incident response, looser for scheduled compliance pulls.
Comparisons
How does Storage Streamer compare to Datadog Archives
Datadog offers three tiers for archived data, each with significant trade-offs:
| Archive Search | Flex Logs | Rehydration | Storage Streamer | |
|---|---|---|---|---|
| Method | Full scan of gzipped JSON -- no indexes | Columnar indexes (Husky) | Batch re-index into hot tier | Bloom filter indexes on your S3 |
| Cost | $0.10/GB scanned per query | Datadog-managed storage fees | Full indexing cost on rehydrated volume | S3 storage only -- no per-query fees |
| Speed | Slow -- reads every byte in range | Fast | Hours to complete | Seconds -- skips 99%+ of files |
| Analytics | None -- flat event list, 100K cap, 24hr expiry | Full Datadog analytics | Full (once rehydrated) | Full -- in any SIEM you choose |
| Data ownership | Your S3 bucket | Datadog-managed, proprietary format -- no export | Your S3 → Datadog hot tier | Your S3 bucket, open formats |
How does Storage Streamer compare to Splunk Federated Search for S3
Splunk's Federated Search for Amazon S3 (GA on Splunk Cloud) queries S3 data via AWS Glue. It's scan-based -- no indexes.
Key limitations:
- ~100 seconds per TB scanned, 10 TB max per search, 100K event default cap
- Splunk Cloud on AWS only -- no Enterprise, no on-prem, no FedRAMP
- Requires AWS Glue Data Catalog (additional AWS cost)
- Licensed via Data Search Units (DSUs) -- Splunk warns high-frequency use may cost more than native ingestion
Storage Streamer difference:
- Bloom filter indexes skip 99%+ of files -- no full-scan pricing or per-TB latency
- No 10 TB cap
- Works with Splunk Cloud and Enterprise (on-prem)
- Results stream as indexed events with original timestamps -- full SPL analytics
- Optional Edge Optimizer adds lossless 50% volume reduction
How does Storage Streamer compare to Cribl
Cribl's suite has three paid layers:
- Stream -- routes and filters data before your SIEM
- Lake -- stores data (BYOS option keeps it in your S3, but charges $0.02/GB/month management fee)
- Search -- queries Lake data using query-time compute
Storage Streamer difference:
- Logs stay in your S3 bucket with no additional management fees
- Search via pre-computed indexes, not query-time compute
- Results stream to your existing SIEM (Splunk, Elastic, Datadog)
- Uses your existing query tools. No per-GB fees beyond S3 storage
Failure Modes & Recovery
What if a Cloud Streamer pod crashes
Pod restart — no data loss, stream resumes from last checkpoint:
- During pod crash: Active S3 stream pauses until pod restarts
- Data loss: NO — events remain in S3. Stream resumes from checkpoint on pod restart
- Recovery: Kubernetes restarts pod automatically (typically <30s)
- Cost impact: Brief gap in streaming. Once pod resumes, events stream normally
Monitoring:
- pod_restart_count > 0 — track streamer pod restarts
- Pod status: kubectl get pods -l app=log10x-streamer -n logging
- Stream lag (if applicable): Check metrics for events-behind or S3-scan-lag
What if S3 becomes unavailable
Streamer pauses gracefully — retries on S3 recovery:
- S3 connection lost: Streamer can't read compact events from S3
- Data loss: NO — events stay in S3, streamer just waits
- Recovery: Automatic when S3 recovers. Streamer resumes streaming from checkpoint
- Timeline: Minutes to hours depending on S3 outage duration
Verification:
- Check S3 bucket status: aws s3 ls s3://your-bucket/ --region your-region
- Check IAM role permissions on streamer pod
- Check streamer logs for S3 auth/connection errors
What if your destination log analyzer is unavailable
Streamer buffers locally and retries — no event loss:
- Analyzer down (Splunk, Elasticsearch, Datadog, etc.): Streamer can't ship events
- Behavior: Local queue fills. Streamer backs off and retries (exponential backoff)
- Data loss: NO — events queue in pod memory and retry until analyzer recovers
- Buffer limit: Default 1GB in-memory queue. If exceeded, events are dropped (with warnings in logs)
- Recovery: Automatic when analyzer recovers. Queued events flush immediately
Prevention:
- Monitor analyzer health from your alerting tool
- Increase local buffer size if you expect extended analyzer downtime: streamer.bufferSize: 2Gi
- Add failover analyzer endpoint if available
Data Portability & Format Independence
Is the compact format proprietary or can I decode independently
The format is NOT proprietary. You can decode independently using open-source tools.
The Log10x compact format is fully decodable without relying on Log10x infrastructure. Multiple open-source decoders are available:
1. Splunk App (Apache 2.0) - 10x for Splunk — open-source app that automatically expands compact events - Proves format is not proprietary — Splunk app is maintained by the community - Works as search-time macro: all your existing KQL queries work unchanged
2. Java Decoder (Apache 2.0)
- Log10x Decoder for Java — standalone library and CLI
- Published on Maven Central
- Use as a library: import com.log10x.decode.SingleEventDecoder
- Use as CLI tool: log10x-decode -t templates.json -f encoded.log -o decoded.log
3. Protocol Documentation - Full encoding/decoding specification in documentation - Open format — no secrets, no proprietary algorithms
What this means for your data:
- Your data is yours: All logs remain in your S3 bucket in a decodable format
- No vendor lock-in: You can decode independently at any time
- Exit strategy: If Log10x goes away, your data is still accessible
- Compliance: Meets data portability and ownership requirements
What's the exit strategy if I want to stop using Log10x
Complete data independence — no lock-in:
Step 1: Export all data from S3
# Query Storage Streamer to pull all archived events
curl -X POST https://streamer.log10x.com/query \
-d '{
"from": "1970",
"to": "now()",
"search": "*",
"elasticsearch_endpoint": "https://your-es.example.com"
}'
# Or stream to S3 output directly
Step 2: Decode using open-source decoder
# Download decoder CLI
wget https://github.com/log-10x/log10x-decoder-java/releases/download/v0.9.0/log10x-decoder-cli-0.9.0-all.jar
# Decode your S3 files locally
java -jar log10x-decoder-cli-0.9.0-all.jar -t templates.json -f encoded.log -o decoded.log
Step 3: Export to open format
# Your decoded logs are now in original JSON/text format
# Export to any destination: SIEM, data lake, archive storage, etc.
aws s3 cp decoded.log s3://your-backup-bucket/
Result: Full data portability in open formats. No data loss, no proprietary dependencies.
What format are the compact events stored in
Human-readable, standardized format:
Each encoded event follows this structure:
Example:
Template mapping (stored separately):
{
"templateHash": "abc123def",
"template": "$(epoch) INFO [main] Processing request for user $ with transaction id $"
}
When decoded: Original event is reconstructed:
2024-01-15T10:30:45.123Z INFO [main] Processing request for user john_doe with transaction id TX-789012
Storage: Templates are stored in tenx_dml index (Splunk) or separate index (other SIEMs) as searchable events. Format is standard JSON — nothing proprietary.
How do I ensure my archived data remains accessible long-term
Best practices for data longevity:
-
Keep templates alongside encoded events - Template index retention policy should match or exceed your S3 archive retention - Templates are the "key" to decoding — losing them makes archived data unreadable - Typical: 7-year retention for compliance archives, templates stored in same S3 bucket
-
Export decoded data periodically - Schedule monthly exports of critical data using the Java decoder - Store decoded data in open formats (JSON, CSV, Parquet) in your S3 bucket - Backup to separate cloud provider or on-prem storage for disaster recovery
-
Document the encoding scheme - Keep the encoding specification and decoder tools with your archived data - GitHub links to open-source decoders are stable and version-tagged - Consider archiving decoder binaries alongside your logs
-
Test decoding annually - Pull a sample of encoded events from S3 - Decode using the Java decoder - Verify decoded output matches your compliance requirements - This validates both the format and your recovery process
Long-term guarantee: Encoded data is independently decodable via open-source tools. Format specification is publicly documented. No dependency on Log10x service to access your own data.