Text File Scanner

Extracts symbol values from text/binary files using plain text tokenization or Jackson parsers.

Parses input by splitting lines with delimiters or using a JsonFactory for structured token reading.

When source code for a log format is unavailable (e.g., third-party services), scan a sample log to extract symbols for future parsing.

Size limit

Files over 50KB are skipped to avoid slow parsing of machine-generated content.

Configuration

To configure the Text file scanner module, Edit these settings.

Below is the default configuration from: text/jackson.yaml.

Edit Online

# 🔟❎ 'compile' text symbol scanner configuration

# The 'text' scanner parses text files for symbol values. It utilizes
# the Jackson parser library to parse any text/binary format it supports.
# This includes formats such json, yaml, xml, ini, protobuf and more.

# The configuration below is added by default to the 10x 'compile' pipeline.
# Even so, if another text scanner is defined via the 'textScanners' options group
# whose 'fileNameFilter' matches that of the current target input file,
# it will take precedence over the text scanners defined below.
# To learn more about text scanner options below, see:
# https://doc.log10x.com/compile/scanner/text

# Set the 10x pipeline to 'compile'
tenx: compile

# =============================== Text Options ================================

text:

  - parserName: text
    fileNameFilter: '^.*\.(csv|txt|properties|csv|tsv|ini|conf|sh|log|out)$'
    maxLines: 500
    lineOffset: 0
    allowDigits: false
    minLength: 2
    maxLength: 50

  - parserName: json
    fileNameFilter: '^.*\.(json)$'
    parserFactoryClass: com.fasterxml.jackson.core.JsonFactory
    scanFieldValues: true

  - parserName: yaml
    fileNameFilter: '^.*\.(yml|yaml)$'
    parserFactoryClass: com.log10x.eng.scanner.text.TextYamlFactory
    scanFieldValues: true

  - parserName: xml
    fileNameFilter: '^.*\.(xml|xsd)$'
    parserFactoryClass: com.fasterxml.jackson.dataformat.xml.XmlFactory
    scanFieldValues: true

Options

Specify the options below to configure multiple Text file scanner:

Name	Description	Category
textParserName	Parser logical name	General
textFileNameFilter	Pattern to match for target input file name	General
textParserFactoryClass	Parser factory class	Parser
textParserFactoryArgs	Arguments for 'textParserFactory' ctor	Parser
textScanFieldValues	Controls whether to capture Jackson parser field values	Parser
textMaxLines	Max number of lines to scan	Text
textLineOffset	Line number from which to start scan	Text
textAllowDigits	Controls whether to capture tokens containing numeric as symbol tokens	Text
textMinLength	Min character length for a token to be considered a symbol value	Text
textMaxLength	Max character length for a token to be considered a symbol value	Text

General

`textParserName`

Parser logical name.

Type	Required	Category
String	✔	General

Defines a logical unique name for this parser (e.g., 'logs').

`textFileNameFilter`

Pattern to match for target input file name.

Type	Required	Category
String	✔	General

Defines a regex pattern a file must match against for this scanner to apply to it.

Parser

`textParserFactoryClass`

Parser factory class.

Type	Default	Category
String	""	Parser

Provides an optional fully qualified name of a class name derived from a JsonFactory.

If specified, the scanner instantiates the factory using a parameterless constructor and invokes its createParser method to generate a parser instance. The scanner uses the parser to read token values from the file.

if 'textParserFactoryArgs' is specified, a constructor receiving a string[] must be defined by this class to receive optional config parameters.

`textParserFactoryArgs`

Arguments for 'textParserFactory' ctor.

Type	Default	Category
List	[]	Parser

Specifies arguments to pass to the parser factory instance constructor. This option only applies if 'textParserFactoryClass' is set.

`textScanFieldValues`

Controls whether to capture Jackson parser field values.

Type	Default	Category
Boolean	false	Parser

Controls whether to capture a Jackson parser's VALUE_STRING tokens are scanned for entries or just values of FIELD_NAME tokens.

This option only applies if 'textParserFactoryClass' is set.

Text

`textMaxLines`

Max number of lines to scan.

Type	Default	Category
Number	0	Text

Controls the maximum number of lines to scan for symbol values from the input file. This option is useful when scanning existing log files as 'templates' for parsing future logs from a similar input stream.

`textLineOffset`

Line number from which to start scan.

Type	Default	Category
Number	0	Text

Specifies the line number from which to start scanning for symbols. This option is useful when scanning a specific portion of a text file.

`textAllowDigits`

Controls whether to capture tokens containing numeric as symbol tokens.

Type	Default	Category
Boolean	false	Text

Controls whether tokens contain numeric chars (e.g., 0-9) are accepted as symbol tokens. As alphanumeric combinations tend to have high cardinality (e.g., GUID, trace_id), it is not generally advised to add them to symbol units unless specifically known to be 'constant' / low cardinality values.

`textMinLength`

Min character length for a token to be considered a symbol value.

Type	Default	Category
Number	0	Text

Sets the minimal character length a token must have to constitute a symbol value. Very short tokens (e.g., len \< 3) have a high probability of being dynamic values with high cardinality and, as such, should not be captured as symbol values.

`textMaxLength`

Max character length for a token to be considered a symbol value.

Type	Default	Category
Number	0	Text

Sets the maximum character length a token must have to constitute a symbol value. Very long tokens (e.g., len > 100) have a high probability of being dynamic values with high cardinality and, as such, should not be captured as symbol values.

This module is defined in text/module.yaml.