Skip to content

FAQ

The storage streamer ships selected events from low-cost storage (S3, Azure Blobs) to log platforms (Splunk, Datadog, Elastic) or as aggregated metrics to time-series DBs (Prometheus, Datadog) periodically or on demand.

Overview

What is Storage Streamer and how does it work

Storage Streamer stores logs in S3 or any S3-compatible object storage (MinIO, Ceph, Azure Blob) and indexes them at ingest time.

When you query, the system scans the index to find which files contain matching data. Only those files are streamed to your SIEM.

Ingestion costs apply only to what you query -- typically 5-30% of total volume -- not all your logs.

How does Storage Streamer reduce costs

Store 100% of logs in S3 at a fraction of analyzer costs. Pay your SIEM license only on the data you actually query -- typically 5-30% of total volume.

Typical cost reduction is 70-80%, depending on query patterns. See the pricing page for details.

What are the main use cases

Incident investigation, scheduled dashboard population, compliance and audit, and metric aggregation — each with concrete savings examples per SIEM vendor. See the full use cases section for query workflows, retrieval times, and cost breakdowns.

Can I pair Storage Streamer with the Regulator

Yes. Configure your forwarder to duplicate events — one copy goes to S3 (all events archived), the other goes through the Regulator (filtered or compact events to your analyzer). Storage Streamer indexes the S3 archive and streams selected events back on demand.

This gives you full retention at S3 cost ($0.023/GB/month) plus regulated analyzer ingestion. Events that the Regulator filters out remain in S3 and are queryable for incident investigation, compliance, and auditing.

Can I pair Storage Streamer with the Regulator's Compact mode

Yes. The Regulator in Compact mode losslessly compacts events at the source, reducing S3 storage costs by 50-65%. Storage Streamer expands compact events automatically when streaming them to your log platform — Splunk, Elasticsearch, Datadog, and CloudWatch all receive full-fidelity events.

Querying and Performance

How do engineers search S3 data

Three ways to query, all documented on the Query page:

  1. Web GUI — visual query builder with search expression editor, time range selector, and sample queries.
  2. CLI — submit queries from the terminal.
  3. REST APIPOST /streamer/query via curl or Postman. Use from scripts, runbooks, or CronJobs.

All three methods produce the same result: the streamer identifies matching events and streams them to your analyzer (Splunk, Elasticsearch, Datadog, CloudWatch) or time-series database (Prometheus, Datadog, Splunk). Events appear with original timestamps alongside your live data.

How fast is data retrieval

Index lookups identify matching files in under 1 second. Fetching and streaming events typically takes 10–30 seconds for ~10K events with default parallelism. At 100K events, parallel scan/stream workers (up to 1,000 instances) keep total query time near the same 10–30 second baseline rather than growing linearly with result set size. At larger scales, S3 API throughput and network bandwidth are the practical limits.

Topic Description
Architecture Parallel scan and stream workers
Performance Baseline numbers, parallelism tuning, and optimization tips
Scan config Timeslice, max instances, and thread settings
Stream config Byte ranges per worker, output endpoints
Limits Processing time and result size caps

S3 Standard provides instant access. S3 Glacier requires restore time.

What cloud storage services are supported

Currently supported:

  • AWS S3 -- primary support, most customers deploy here
  • Azure Blob Storage
  • Any S3-compatible object storage -- MinIO, Ceph (RADOS Gateway), Dell ECS, NetApp StorageGRID, Cloudflare R2, Backblaze B2. The streamer talks to the S3 API as a storage interface; any backend that implements it works by configuring the endpoint URL.

Works with your existing bucket structure and log formats: JSON, plain text, and gzipped files. Common storage tiers: S3 Standard ($0.023/GB/month) or S3 Intelligent-Tiering for older data.

Google Cloud Storage has native S3 compatibility via HMAC keys, so it works today through the S3 API. Native GCS support is on the roadmap.

For air-gapped or on-prem environments, MinIO is the most common choice -- it runs as containers in Kubernetes with no external dependencies.

Can I run Storage Streamer without AWS

Yes. The streamer uses the S3 API as a storage interface -- it does not depend on AWS.

  • Local development: The setup guide uses LocalStack (local S3/SQS emulation) running in Kubernetes. No AWS account needed.
  • Production without AWS: Point the streamer at any S3-compatible endpoint -- MinIO, Ceph, GCS (via HMAC keys), Cloudflare R2. Configure the endpoint URL in your deployment.
  • SQS alternative: MinIO provides built-in bucket notifications to NATS, Kafka, or webhooks as alternatives to SQS for triggering indexing.

Air-gapped deployments typically use MinIO for both the log bucket and the index bucket, running entirely within the Kubernetes cluster.

Which SIEMs can receive streamed data

Native integrations:

  • Splunk (HEC)
  • Elastic / OpenSearch
  • Datadog
  • AWS CloudWatch

Generic: Any HTTP endpoint, syslog, or TCP destination. Logs stream with original timestamps preserved.

For Splunk users, the optional Regulator in Compact mode compacts events at the source for an additional 50% reduction in ingestion volume. Events are expanded transparently via the 10x for Splunk app at search time.

For Elasticsearch and OpenSearch users, the L1ES plugin expands compact events at query time. Kibana searches, dashboards, and alerts work unchanged on the original full-fidelity data.

How does query performance scale with data volume

Horizontally. The query architecture has no central bottleneck — scan and stream workers run as independent parallel pods.

Indexing (ingest time): S3 event notifications trigger index worker pods via SQS. Each worker processes files independently — no shared state, no coordination. Workers scale horizontally via Kubernetes HPA. Doubling the worker count doubles indexing throughput.

Querying (search time): Parallel scan workers list and filter S3 keys concurrently (up to 1,000 parallel instances by default). Each scan worker covers a time slice of the query range. Matching byte ranges are submitted to parallel stream workers that fetch, transform, and output events.

Query scope Scan workers Behavior
10 min range, 1 min timeslice ~10 parallel Total time ≈ single-worker baseline (10–30s)
1 hour range, 1 min timeslice ~60 parallel Total time stays near baseline, not 6x longer
24 hour range, 1 min timeslice ~1,000 parallel (capped) S3 API rate and network bandwidth become the limit

No central database — index files are Bloom filters stored alongside log files in S3. The query system reads them directly from S3, with no intermediary database to bottleneck or manage.

Architecture and Indexing

How are S3 indexes built

S3 event notifications trigger index worker pods via SQS. Each worker reads the uploaded file, extracts template hashes and variable values, and writes lightweight Bloom filters to the index bucket, adding less than 1% storage overhead. Indexes are built once at ingest time and never recomputed.

Topic Description
Workflow How files are parsed into TenXObjects and written as Bloom filters
TenXTemplate Filters Filter design, volume reduction, and batch retrieval
Compute Resources Pod sizing, autoscaling, and throughput
Cost Pod, S3, and SQS cost breakdown
Scaling HPA, deployment topologies, and backlog handling
Accuracy Bloom filter accuracy and false positive tuning
What happens if index building falls behind uploads

SQS buffers pending work — no events are lost. Index worker pods scale up automatically via HPA, and unindexed files remain queryable (full scan, slower). See the deployment guide for scaling configuration.

Query Limits

How do I prevent accidental over-ingestion when streaming from S3

Storage Streamer enforces automatic cost guardrails on every query to prevent expensive accidents:

Default limits per query: - Processing time: 1 minute max execution — query terminates automatically - Result size: 10 MB max bytes returned — query stops once reached

Examples:

Scenario 1: Quick incident investigation (safe defaults)

curl -X POST http://streamer:8080/streamer/query \
  -d '{"from":"now(\"-1h\")","to":"now()","search":"level == \"ERROR\""}'
# Stops automatically after 1 minute or 10 MB, whichever comes first
# Typical cost: $0.10-$0.50 depending on volume

Scenario 2: Large historical pull (custom limits)

# Configure via Kubernetes values or query config to increase limits if needed:
# queryLimitProcessingTime: 5m  # Increase to 5 minutes
# queryLimitResultSize: 500MB   # Increase to 500 MB
# Only do this when you understand the cost implications

Safety mechanisms: - Queries abort gracefully when limits hit — no partial ingestion - Default 10 MB limit at ~$0.025/GB ingestion cost = ~$0.00025 per query max - For 1 TB daily ingestion, even 100 queries hit only $0.025 extra cost - Override limits explicitly in config — no silent surprises

Monitoring: - Check query execution logs for "limit exceeded" messages - Monitor S3 scan logs to see how many files matched your query - Set up Kubernetes alerts on streamer pod for failed queries

What happens if a query exceeds its limits

Processing time exceeded (1 minute default): - Query terminates gracefully - Partial results already streamed to analyzer remain ingested - No rollback — the analyzer retains what was sent before timeout - Recommendation: adjust query to use narrower time range or add more filters

Result size exceeded (10 MB default): - Query stops reading more files - All events read so far are streamed to analyzer - No cost surprise — you get capped at expected bytes - Recommendation: add search filters to narrow scope (e.g., higher log level, specific service)

Both are safe failures — worst case you ingest partial results, not the entire historical dataset. Adjust limits per use case: tight for incident response, looser for scheduled compliance pulls.

Comparisons

How does Storage Streamer compare to Datadog Archives

Datadog offers three tiers for archived data, each with significant trade-offs:

Archive Search Flex Logs Rehydration Storage Streamer
Method Full scan of gzipped JSON -- no indexes Columnar indexes (Husky) Batch re-index into hot tier Bloom filter indexes on your S3
Cost $0.10/GB scanned per query Datadog-managed storage fees Full indexing cost on rehydrated volume S3 storage only -- no per-query fees
Speed Slow -- reads every byte in range Fast Hours to complete Seconds -- skips 99%+ of files
Analytics None -- flat event list, 100K cap, 24hr expiry Full Datadog analytics Full (once rehydrated) Full -- in any SIEM you choose
Data ownership Your S3 bucket Datadog-managed, proprietary format -- no export Your S3 → Datadog hot tier Your S3 bucket, open formats
How does Storage Streamer compare to Splunk Federated Search for S3

Splunk's Federated Search for Amazon S3 (GA on Splunk Cloud) queries S3 data via AWS Glue. It's scan-based -- no indexes.

Key limitations:

  • ~100 seconds per TB scanned, 10 TB max per search, 100K event default cap
  • Splunk Cloud on AWS only -- no Enterprise, no on-prem, no FedRAMP
  • Requires AWS Glue Data Catalog (additional AWS cost)
  • Licensed via Data Search Units (DSUs) -- Splunk warns high-frequency use may cost more than native ingestion

Storage Streamer difference:

  • Bloom filter indexes skip 99%+ of files -- no full-scan pricing or per-TB latency
  • No 10 TB cap
  • Works with Splunk Cloud and Enterprise (on-prem)
  • Results stream as indexed events with original timestamps -- full SPL analytics
  • Optional Regulator in Compact mode adds lossless 50% volume reduction
How does Storage Streamer compare to Cribl
  • Logs stored in your S3 bucket with no management fees
  • Queries use pre-computed indexes — not query-time compute — so retrieval takes seconds, not minutes
  • Results stream to your existing SIEM with no per-GB query fees beyond S3 storage

See What makes 10x different for the general comparison.

Failure Modes and Recovery

What if a Cloud Streamer pod crashes

Pod restart — no data loss, stream resumes from last checkpoint:

  • During pod crash: Active S3 stream pauses until pod restarts
  • Data loss: NO — events remain in S3. Stream resumes from checkpoint on pod restart
  • Recovery: Kubernetes restarts pod automatically (typically <30s)
  • Cost impact: Brief gap in streaming. Once pod resumes, events stream normally

Monitoring: - pod_restart_count > 0 — track streamer pod restarts - Pod status: kubectl get pods -l app=log10x-streamer -n logging - Stream lag (if applicable): Check metrics for events-behind or S3-scan-lag

What if S3 becomes unavailable

Streamer pauses gracefully — retries on S3 recovery:

  • S3 connection lost: Streamer can't read compact events from S3
  • Data loss: NO — events stay in S3, streamer just waits
  • Recovery: Automatic when S3 recovers. Streamer resumes streaming from checkpoint
  • Timeline: Minutes to hours depending on S3 outage duration

Verification: - Check S3 bucket status: aws s3 ls s3://your-bucket/ --region your-region - Check IAM role permissions on streamer pod - Check streamer logs for S3 auth/connection errors

What if your destination log analyzer is unavailable

Streamer buffers locally and retries — no event loss:

  • Analyzer down (Splunk, Elasticsearch, Datadog, etc.): Streamer can't ship events
  • Behavior: Local queue fills. Streamer backs off and retries (exponential backoff)
  • Data loss: NO — events queue in pod memory and retry until analyzer recovers
  • Buffer limit: Default 1GB in-memory queue. If exceeded, events are dropped (with warnings in logs)
  • Recovery: Automatic when analyzer recovers. Queued events flush immediately

Prevention: - Monitor analyzer health from your alerting tool - Increase local buffer size if you expect extended analyzer downtime: streamer.bufferSize: 2Gi - Add failover analyzer endpoint if available

Data Portability and Format Independence

Is the compact format proprietary or can I expand independently

The format is NOT proprietary. You can expand independently using open-source tools.

The Log10x compact format is fully expandable without relying on Log10x infrastructure. Multiple open-source expansion tools are available:

1. Splunk App (Apache 2.0) - 10x for Splunk — open-source app that automatically expands compact events - Works as search-time macro: all your existing KQL queries work unchanged

2. Java Library and CLI (Apache 2.0) - Log10x Decoder for Java — standalone library and CLI - Published on Maven Central - CLI: log10x-decode -t templates.json -f compact.log -o expanded.log

3. Format Specification - Full compact format specification — open, no proprietary algorithms

What this means for your data:

  • Your data is yours: All logs remain in your S3 bucket in an expandable format
  • No vendor lock-in: You can expand independently at any time
  • Exit strategy: If Log10x goes away, your data is still accessible
  • Compliance: Meets data portability and ownership requirements
What's the exit strategy if I want to stop using Log10x

Your logs are in your S3 bucket on your AWS account — Log10x never had access to them. Stop running the streamer pods and your data stays where it is. If events were stored in compact form, they can be expanded independently using open-source tools — see the Q above.

What format are the compact events stored in

An open, human-readable format documented in the compact format reference. See real-world before/after examples showing verbose events and their compact form.