Index
The Index module enables storage queries to fetch blob byte ranges (e.g., AWS S3, Azure) matching a specified app/service name, search term(s) and timestamp range predictably at scale.
Workflow
The index module executes when S3 event notifications are sent directly to SQS queues, triggering index workers to process uploaded files.
The module comprises an input and output stream:
-
The input stream reads the events from the uploaded file to transform them into TenXObjects.
-
The output stream performs the following actions for each TenXObject:
- Write its template to the index container (if not exists).
- Map its timestamp to a Bloom filter associated with a rolling time window specified by indexWriteResolution.
- Append its templateHash and vars to the Bloom filter's hash set. Once a filter's size exceeds the Object storage's key byte length (e.g., for AWS S3 1024 bytes), the stream writes it to the index container and assigns a new filter to the time window.
TenXTemplate Filters
TenXTemplate Bloom filters enable parallel traversal of the index container (e.g., S3 bucket) with high test accuracy for fetching required byte ranges to scan for matching log/trace events.
Separating low-cardinality symbol values into TenXTemplates and writing only template hashes and high-cardinality variables to Bloom filters reduces their volume by over 75% compared to appending both low and high-cardinality values.
Restricting Bloom filter size to the object storage's key length enables batch retrieval of filters via list operations (e.g., AWS S3: 1000 keys/request, Azure: 5000 keys/request, GCP: 1000 keys/request).
graph LR
A["📥 Log File Upload<br/>(S3/Azure/GCP)"] --> B["Read Events<br/>Stream"]
B --> C["Transform to<br/>TenXObjects"]
C --> D["⚙️ Process<br/>Each Object"]
D --> E["Extract Variables<br/>& Template"]
E --> F["Map to<br/>Time Window"]
F --> G{"Filter Size<br/>< 1024 bytes?"}
G -->|Yes| H["Append to<br/>Current Filter"]
G -->|No| I["Write Filter<br/>to Index"]
H --> J{"More<br/>Events?"}
I --> K["Create New<br/>Filter"]
K --> J
J -->|Yes| D
J -->|No| L["✅ Index Objects<br/>Ready for Query"]
E --> M["Write Template<br/>(if new)"]
M --> L
classDef input fill:#2563eb88,stroke:#1d4ed8,color:#ffffff,stroke-width:2px,rx:8,ry:8
classDef processing fill:#05966988,stroke:#047857,color:#ffffff,stroke-width:2px,rx:8,ry:8
classDef output fill:#dc262688,stroke:#b91c1c,color:#ffffff,stroke-width:2px,rx:8,ry:8
classDef decision fill:#7c3aed88,stroke:#6d28d9,color:#ffffff,stroke-width:2px,rx:8,ry:8
classDef result fill:#ea580c88,stroke:#c2410c,color:#ffffff,stroke-width:2px,rx:8,ry:8
class A,B,C input
class D,E,F processing
class G,J decision
class H,I,K,M output
class L result
Compute Resources
Indexing is CPU and memory intensive during file parsing. Default k8s pod resources:
- 1 CPU and 2GB memory per pod (see deployment guide)
- Autoscaling: 2–10 replicas depending on queue depth (default 2 min, scales to 10 if backlog grows)
- Throughput: One pod handles ~10–50 GB/day depending on event size and CPU availability
Indexing runs asynchronously — triggered by S3 event notifications, in parallel with queries. Multiple index workers process files concurrently from the SQS queue. Indexes are built once at ingest time and never recomputed.
Cost
Index building cost is part of the k8s pod resource costs — no per-GB indexing fee. You pay:
- k8s pod (CPU + memory) running the index workers
- S3 storage for index objects (~1–5% overhead vs. original data size)
- SQS queue operations (~$0.40 per million messages)
Scaling
If files upload faster than indexing, the SQS queue buffers pending work — no events are lost. Index worker pods scale up automatically via Kubernetes HPA.
Unindexed files remain queryable via full scan (slower than indexed queries but functional).
Deployment topologies:
- All-in-one: Single pod cluster handles index, query, and stream roles (suitable for \<100 GB/day)
- Separate clusters: Dedicated index/query/stream pods allow independent scaling (recommended for >500 GB/day)
See the deployment guide for sizing guidance.
Config Files
To configure the Object storage index output module, Edit these files.
Options
Specify the options below to configure multiple Object storage index output:
| Name | Description | Category |
|---|---|---|
| indexObjectStorageName | Object storage logical name | Container |
| indexReadContainer | Name of the input Object Storage container | Container |
| indexReadObject | Name of the object storage blob to index | Container |
| indexWriteContainer | Name of target index container | Container |
| indexWriteTarget | Logical name identifying the origin of 'indexReadObject' | Output |
| indexReadExtractMessage | Use extractor for inner message | Parsing |
| indexReadMessageField | Message field name | Parsing |
| indexWriteByteRange | Max byte range size to index 'indexReadObject' | Accuracy |
| indexWriteResolution | Index time window resolution | Accuracy |
| indexWriteAccuracy | Bloom filter accuracy of index objects. | Accuracy |
| indexWriteTemplateMergeInterval | Merge template interval | Advanced |
| indexObjectStorageArgs | Custom Object storage args | Advanced |
| indexReadPrintProgress | Sets whether this input prints throughput stats to the console | General |
Container
indexObjectStorageName
Object storage logical name.
| Type | Required | Category |
|---|---|---|
| String | ✔ | Container |
Identifies the Object Storage containing the blob to index (e.g., AWS).
indexReadContainer
Name of the input Object Storage container.
| Type | Required | Category |
|---|---|---|
| String | ✔ | Container |
Specifies the Object Storage container (e.g., AWS S3 bucket) containing the blob (e.g., log file) to index.
indexReadObject
Name of the object storage blob to index.
| Type | Required | Category |
|---|---|---|
| String | ✔ | Container |
Specifies the blob (e.g., log file) name within indexReadContainer to index.
indexWriteContainer
Name of target index container.
| Type | Required | Category |
|---|---|---|
| String | ✔ | Container |
Specifies the storage container (e.g., AWS S3 bucket) to output TenXTemplate Filters (e.g., TenXTemplates and Bloom filters).
Output
indexWriteTarget
Logical name identifying the origin of 'indexReadObject'.
| Type | Required | Category |
|---|---|---|
| String | ✔ | Output |
Specifies a logical name to store index objects produced for indexReadObject under.
This name commonly specifies the app which generated the events enclosed within this blob (e.g. acme-client).
Parsing
indexReadExtractMessage
Use extractor for inner message.
| Type | Default | Category |
|---|---|---|
| Boolean | false | Parsing |
Specifies whether to extract an inner field from the entire event json to use as the base for constructing the TenXObject.
indexReadMessageField
Message field name.
| Type | Default | Category |
|---|---|---|
| String | log | Parsing |
Name of the actual message field in the event json to use to construct the TenXObject, used only if indexReadExtractMessage is true.
Accuracy
indexWriteByteRange
Max byte range size to index 'indexReadObject'.
| Type | Required | Category |
|---|---|---|
| Number | ✔ | Accuracy |
Controls the chunk size in which to index the target object.
For example, if the target object is 1GB and this value is 2MB,
index the object in 2MB segments to ensure matching queries
can retrieve chunks vs. all of it unnecessarily.
To learn more see: byte range fetches.
indexWriteResolution
Index time window resolution.
| Type | Required | Category |
|---|---|---|
| Number | ✔ | Accuracy |
Controls the index time range resolution.
For example, setting this to 1min means that queries to the index at time
ranges greater than 1min (e.g. 15min) will not fetch byte ranges
outside the time frame unnecessarily.
The lower this value is, the greater the output index size will be.
This value should satisfy the minimum resolution for querying the index. For example, if queries to the index are in 5-minute increments:
Setting this value to 5min will create the most efficient index.
indexWriteAccuracy
Bloom filter accuracy of index objects.
| Type | Required | Category |
|---|---|---|
| Number | ✔ | Accuracy |
Controls the accuracy of bloom filter TenXTemplate Filters.
The index output stream produces a list of Bloom filters for each indexWriteResolution and indexWriteByteRange combination of the target blob. Query inputs utilize these filter objects to rule out byte ranges where the query criteria are known NOT to match.
For example, if a target blob weighing 10MB contains events whose timestamps
range from the beginning of the hour to 3min later, and indexWriteResolution
is set to 1min and indexWriteByteRange is set to 2Mb, up to 6 ranges
are indexed separately, where the templateHash and vars
members of each TenXObject are within that range
are added to a list of bloom filters whose accuracy must not fall below this value.
The greater the accuracy, the greater the list of filters is created.
The query input uses Bloom filters to evaluate whether their corresponding byte ranges contain target search terms with an accuracy (i.e., chance of false positive) set by these values. In other words, if this value is 95, there is a 5% chance that a byte range that does NOT contain target terms is fetched and scanned.
Advanced
indexWriteTemplateMergeInterval
Merge template interval.
| Type | Default | Category |
|---|---|---|
| Number | 0 | Advanced |
Specifies the interval to wait between template merge operation. Each index operation stores output TenXTemplate objects in the indexWriteContainer. Index operations merge templates files into a single file periodically, with the period interval set by this value.
indexObjectStorageArgs
Custom Object storage args.
| Type | Default | Category |
|---|---|---|
| List | [] | Advanced |
Custom arguments passed as a map to the constructor of the underlying object storage. This list is expected to hold pairs of key values (e.g., args: [key1, value1, key2, value2]).
General
indexReadPrintProgress
Sets whether this input prints throughput stats to the console.
| Type | Default | Category |
|---|---|---|
| Boolean | false | General |
Sets whether this input prints throughput stats to the console. This value is commonly used when testing an integration to a remote endpoint.
This module is defined in index/module.yaml.