Skip to content

ANTLR Scanner

Extracts symbols from source files using ANTLR parsers.

The rules for each language control which symbols to extract from the AST and their context within the source file (e.g. class name, constant, enum,..). This enables generating symbol units from source code in virtually any programming language for which a grammar is defined.

Extensibility

Parsing source code files for a given language can be done by:

  • Compiling a g4 grammar file using the ANTLR tool into a parser class file and specifying its fully qualified class name via the antlrParserClass argument.

  • Loading a .g4 grammar file directly via the antlrGrammarFile argument, assuming the target grammar does not require any language-specific code extensions.

The ANTLR scanner uses either pre-compiled or interpreted (.g4) parsers to read the contents of input files and iterate their AST.

To learn more see the ANTLR tutorial.

Modules

  • ANTLR Languages


    Extract symbol values from a source code in a variety of programming language using ANTLR.

    More info

  • ANTLR Language Ruleses


    Rules for extracting symbols from a target ANTLR AST.

    More info

Configuration

To configure the ANTLR scanner module, Edit these settings.

Below is the default configuration from: antlr/java.yaml.

Edit Online

Edit java.yaml Locally

# 🔟❎ 'compile' Java ANTLR symbol scanner configuration

# The 'java' configuration instructs the ANTLR scanner which symbol values (e.g. class/func names, code constants)
# to extract from Java source files.  

# IMPORTANT: while set to enabled by default, the dedicated java scanner utilizes 
# takes precedence over this scanner unless disabled.
# To learn more see https://doc.log10x.com/compile/scanner/javaParser

# Set the 10x pipeline to 'compile'
tenx: compile

# ============================== ANTLR Options ================================

antlr:

  # The 'antlrLang' module is defined in: https://doc.log10x.com/compile/scanner/antlr/langs
  # These values define the ANTLR-generated parser class used to
  # to tokenize and construct an AST for the target Java file.

  lang: java

  fileExt:
    - .java

  parserClass: com.log10x.antlr.generated.java.Java9Parser

  lexerClass: com.log10x.antlr.generated.java.Java9Lexer

  rootRule: compilationUnit

  reserved:
    - package
    - interface
    - class
    - enum

  # 'lineFilters' specifies a list of regex patterns for skipping input lines 
  lineFilters:
    - ^\s*(//.*|/\*.*\*/)$

  # The 'rule' list defines the nodes within a Java AST tree that are captured
  # and added to the output symbol unit.
  # To learn more see: https://doc.log10x.com/compile/scanner/antlr/rules/
  rule:
    - name: packageDeclaration
      lang: java
      context: package
      recursive: false
      capture: allSymbols

    - name: methodInvocation
      lang: java
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: enumDeclaration
      lang: java
      context: class
      recursive: false
      capture: allSymbols

    - name: annotation
      lang: java
      context: annotation_invoke
      recursive: true
      capture: literalsOnly

    - name: normalClassDeclaration
      lang: java
      context: class
      recursive: false
      capture: allSymbols

    - name: variableInitializer
      lang: java
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: enumConstant
      lang: java
      context: enum
      recursive: false
      capture: allSymbols

    - name: returnStatement
      lang: java
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: methodDeclaration
      lang: java
      context: method_decl
      recursive: false
      capture: literalsOnly

    - name: argumentList
      lang: java
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: methodInvocation_lfno_primary
      lang: java
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: methodDeclarator
      lang: java
      context: method_decl
      recursive: false
      capture: allSymbols

    - name: methodInvocation_lf_primary
      lang: java
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: normalInterfaceDeclaration
      lang: java
      context: class
      recursive: false
      capture: allSymbols

Below is the default configuration from: antlr/cpp.yaml.

Edit Online

Edit cpp.yaml Locally

# 🔟❎ 'compile' C++ ANTLR symbol scanner configuration

# The 'cpp' configuration instructs the ANTLR scanner which symbol values (e.g. class/func names, code constants)
# to extract from c++ source files.  

# Set the 10x pipeline to 'compile'
tenx: compile

# ============================== ANTLR Options ================================

antlr:

  # The 'antlrLang' module is defined in: https://doc.log10x.com/compile/scanner/antlr/langs
  # These values define the ANTLR grammar .g4 file used to
  # to tokenize and construct an AST for the target cpp file

  lang: cpp

  fileExt:
    - .cpp
    - .cxx
    - .c++
    - .c
    - .h
    - .H
    - .hpp
    - .hxx
    - .h++
    - .cc
    - .hh

  grammarFile: grammar/CPP14Parser.g4

  lexerFile: grammar/CPP14Lexer.g4

  rootRule: translationUnit

  reserved:
    - '::'
    - '<<'
    - inline
    - namespace
    - <
    - class
    - '>'

  # 'lineFilters' specifies a list of regex patterns for skipping input lines 
  lineFilters:
    - ^\s*(//.*|/\*.*\*/)$
    - ^.*\bnew\(\);\b.*$

  # The 'rule' list defines the nodes within a cpp AST tree that are captured
  # and added to the output symbol unit.
  # To learn more see: https://doc.log10x.com/compile/scanner/antlr/rules/
  rule:
    - name: enumerator
      lang: cpp
      context: enum
      recursive: false
      capture: allSymbols

    - name: blockDeclaration
      lang: cpp
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: enumbase
      lang: cpp
      context: enum
      recursive: true
      capture: literalsOnly

    - name: classHeadName
      lang: cpp
      context: class
      recursive: true
      capture: allSymbols

    - name: enumSpecifier
      lang: cpp
      context: class
      recursive: false
      capture: literalsOnly
      condition: ^enumHead$
      tag: EnumHead

    - name: enumHead
      lang: cpp
      context: class
      recursive: false
      capture: allSymbols
      ifTag: EnumHead

    - name: functionDefinition
      lang: cpp
      context: method_decl
      recursive: false
      capture: allSymbols
      condition: ^functionDefinition$
      tag: FunctionDefinition

    - name: declaratorid
      lang: cpp
      context: method_decl
      recursive: true
      capture: allSymbols
      ifTag: FunctionDefinition

    - name: jumpStatement
      lang: cpp
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: functionBody
      lang: cpp
      context: method_invoke
      recursive: false
      capture: literalsOnly
    # condition: ^functionBody$
    # tag: FunctionBody

    - name: statement
      lang: cpp
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: classSpecifier
      lang: cpp
      context: class
      recursive: false
      capture: literalsOnly
      subRule: memberSpecification

    - name: staticAssertDeclaration
      lang: cpp
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: parameterDeclarationList
      lang: cpp
      context: method_invoke
      recursive: false
      capture: literalsOnly
    # condition: ^parameterDeclarationList$
    # tag: ParameterDeclarationList

    - name: enumeratorDefinition
      lang: cpp
      context: enum
      recursive: false
      capture: allSymbols

    - name: namespaceDefinition
      lang: cpp
      context: package
      recursive: false
      capture: allSymbols

    - name: constantExpression
      lang: cpp
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name:  templateArgumentList
      lang: cpp
      context: method_invoke
      recursive: false
      capture: literalsOnly
    # condition: ^templateArgumentList$
    # tag: TemplateArgumentList

Below is the default configuration from: antlr/scala.yaml.

Edit Online

Edit scala.yaml Locally

# 🔟❎ 'compile' Scala symbol scanner configuration

# The 'scala' configuration instructs the ANTLR scanner which symbol values (e.g. class/func names, code constants)
# to extract from Scala source files.

# IMPORTANT: while set to enabled by default, the dedicated scala scanner utilizes 
# takes precedence over this scanner unless disabled.
# To learn more see https://doc.log10x.com/compile/scanner/scalameta

# Set the 10x pipeline to 'compile'
tenx: compile

# ============================== ANTLR Options ================================

antlr:

  # The 'antlrLang' module is defined in: https://doc.log10x.com/compile/scanner/antlr/langs
  # These values define the ANTLR grammar .g4 file used to
  # to tokenize and construct an AST for the target .Scala file.

  lang: scala

  fileExt:
    - .scala
    - .sc

  grammarFile: grammar/scala.g4

  rootRule: compilationUnit

  reserved:
    - interface
    - class
    - enum
    - this

  # 'lineFilters' specifies a list of regex patterns for skipping input lines 
  lineFilters:
    - ^\s*(//.*|/\*.*\*/)$

  # The 'rule' list defines the nodes within a scala AST tree that are captured
  # and added to the output symbol unit.
  # To learn more see: https://doc.log10x.com/compile/scanner/antlr/rules/
  rule:
  - name: classParamClauses
    lang: scala
    context: class
    recursive: false
    capture: literalsOnly

  - name: patVarDef
    lang: scala
    context: var_assign
    recursive: true
    capture: literalsOnly

  - name: argumentExprs
    lang: scala
    context: method_invoke
    recursive: true
    capture: literalsOnly

  - name: annotation
    lang: scala
    context: annotation_invoke
    recursive: true
    capture: literalsOnly

  - name: objectDef
    lang: scala
    context: class
    recursive: true
    capture: allSymbols

  - name: paramClauses
    lang: scala
    context: class
    recursive: false
    capture: literalsOnly

  - name: classTemplateOpt
    lang: scala
    context: class
    recursive: false
    capture: literalsOnly

  - name: funDcl
    lang: scala
    context: method_decl
    recursive: false
    capture: allSymbols

  - name: classDef
    lang: scala
    context: class
    recursive: true
    capture: allSymbols

  - name: qualId
    lang: scala
    context: package
    recursive: true
    capture: allSymbols

  - name: traitDef
    lang: scala
    context: class
    recursive: true
    capture: allSymbols

  - name: funDef
    lang: scala
    context: method_decl
    recursive: true
    capture: allSymbols
    subRule: funSig

  - name: type_
    lang: scala
    context: method_decl
    recursive: false
    capture: literalsOnly

  - name: blockStat
    lang: scala
    context: var_assign
    recursive: true
    capture: literalsOnly

  - name: templateStat
    lang: scala
    context: method_invoke
    recursive: true
    capture: literalsOnly

Below is the default configuration from: antlr/go.yaml.

Edit Online

Edit go.yaml Locally

# 🔟❎ 'compile' Go ANTLR symbol scanner configuration

# The 'go' configuration instructs the ANTLR scanner which symbol values (e.g. class/func names, code constants)
# to extract from GO source files.  

# Set the 10x pipeline to 'compile'
tenx: compile

# ============================== ANTLR Options ================================

antlr:

  # The 'antlrLang' module is defined in: https://doc.log10x.com/compile/scanner/antlr/langs
  # These values define the ANTLR-generated parser class used to
  # to tokenize and construct an AST for the target go file.
  lang: go

  fileExt:
    - .go

  parserClass: com.log10x.antlr.generated.golang.GoParser

  lexerClass: com.log10x.antlr.generated.golang.GoLexer

  rootRule: sourceFile

  reserved:
    - package
    - func

  # 'lineFilters' specifies a list of regex patterns for skipping input lines 
  lineFilters:
    - ^\s*(//.*|/\*.*\*/)$

  # The 'rule' list defines the nodes within a go AST tree that are captured
  # and added to the output symbol unit.
  # To learn more see: https://doc.log10x.com/compile/scanner/antlr/rules/
  rule:
    - name: constSpec
      lang: go
      context: enum
      recursive: false
      capture: allSymbols
      subRule: identifierList

    - name: returnStmt
      lang: go
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: assignment
      lang: go
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: expressionStmt
      lang: go
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: packageClause
      lang: go
      context: package
      recursive: false
      capture: allSymbols

    - name: arguments
      lang: go
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: typeSpec
      lang: go
      context: class
      recursive: false
      capture: allSymbols

    - name: functionDecl
      lang: go
      context: method_decl
      recursive: false
      capture: allSymbols

    - name: methodDecl
      lang: go
      context: method_decl
      recursive: false
      capture: allSymbols

Below is the default configuration from: antlr/javascript.yaml.

Edit Online

Edit javascript.yaml Locally

# 🔟❎ 'compile' JavaScript ANTLR symbol scanner configuration

# The 'javascript' configuration instructs the ANTLR scanner which symbol values (e.g. class/func names, code constants)
# to extract from JavaScript source files.  

# Set the 10x pipeline to 'compile'
tenx: compile

# ============================== ANTLR Options ================================

antlr:

  # The 'antlrLang' module is defined in: https://doc.log10x.com/compile/scanner/antlr/langs
  # These values define the ANTLR-generated parser class used to
  # to tokenize and construct an AST for the target JavaScript file.

  lang: javascript

  fileExt:
  - .js

  parserClass: com.log10x.antlr.parsers.JavaScriptDecoratedParser

  lexerClass: com.log10x.antlr.generated.javascript.JavaScriptLexer

  rootRule: program2

  reserved:
    - function
    - constructor
    - class

  maxLineLength: 4096

  # 'lineFilters' specifies a list of regex patterns for skipping input lines 
  lineFilters:
    - ^\s*(//.*|/\*.*\*/)$

  # The 'rule' list defines the nodes within a js AST tree that are captured
  # and added to the output symbol unit.
  # To learn more see: https://doc.log10x.com/compile/scanner/antlr/rules/
  rule:
    - name: functionDeclaration
      lang: javascript
      context: method_decl
      recursive: false
      capture: allSymbols
      subRule: identifier

    - name: returnStatement
      lang: javascript
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: classDeclaration
      lang: javascript
      context: class
      recursive: false
      capture: allSymbols
      subRule: identifier

    - name: methodDefinition
      lang: javascript
      context: method_decl
      recursive: true
      capture: allSymbols
      subRule: identifier

    - name: formalParameterList
      lang: javascript
      context: method_decl
      recursive: false
      capture: literalsOnly

    - name: functionBody
      lang: javascript
      context: method_invoke
      recursive: false
      capture: literalsOnly

    - name: functionDecl
      lang: javascript
      context: method_decl
      recursive: false
      capture: allSymbols

    - name: anoymousFunctionDecl
      lang: javascript
      context: method_decl
      recursive: false
      capture: allSymbols

    - name: argumentsExpression
      lang: javascript
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: variableDeclaration
      lang: javascript
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: assignmentExpression
      lang: javascript
      context: var_assign
      recursive: true
      capture: literalsOnly

Below is the default configuration from: antlr/python.yaml.

Edit Online

Edit python.yaml Locally

# 🔟❎ 'compile' Python symbol scanner configuration

# The 'python' configuration instructs the ANTLR scanner which symbol values (e.g. class/func names, code constants)
# to extract from Python source files.  

# NOTE: while set to enabled by default, the 'pythonAST' scanner which utilizes
# the python run-time's built-in AST parser takes precedence over this scanner
# unless a python run-time is not installed or the 'pythonAST' scanner is disabled.
# To learn see https://doc.log10x.com/compile/scanner/pythonAST

# Set the 10x pipeline to 'compile'
tenx: compile

# ============================== ANTLR Options ================================

antlr:

  # The 'antlrLang' module is defined in: https://doc.log10x.com/compile/scanner/antlr/langs
  # These values define the ANTLR-generated parser class used to
  # to tokenize and construct an AST for the target Python file.

  lang: python

  fileExt:
    - .py

  parserClass: com.log10x.antlr.generated.python.PythonParser

  lexerClass: com.log10x.antlr.generated.python.PythonLexer

  rootRule: root

  reserved:
    - def
    - __str__
    - __init__
    - class

  # 'lineFilters' specifies a list of regex patterns for skipping input lines 
  lineFilters:
    - ^\s*(//.*|/\*.*\*/)$

  # The 'rule' list defines the nodes within a Python AST tree that are captured
  # and added to the output symbol unit.
  # To learn more see: https://doc.log10x.com/compile/scanner/antlr/rules/
  rule:
    - name: classdef
      lang: python
      context: class
      recursive: false
      capture: allSymbols
      subRule: name
      condition: (^Enum$|^enum$)
      tag: enumLiteral

    - name: testlist_star_expr
      lang: python
      context: enum
      recursive: true
      capture: allSymbolsIfMatchCond
      ifTag: enumLiteral

    - name: trailer
      lang: python
      context: method_invoke
      recursive: true
      capture: literalsOnly
      subRule: name

    - name: return_stmt
      lang: python
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: decorator
      lang: python
      context: annotation_invoke
      recursive: true
      capture: literalsOnly
      subRule: arglist

    - name: arglist
      lang: python
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: funcdef
      lang: python
      context: method_decl
      recursive: false
      capture: allSymbols
      subRule: name

    - name: assign_part
      lang: python
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: dictorsetmaker
      lang: python
      context: var_assign
      recursive: true
      capture: literalsOnly

Below is the default configuration from: antlr/csharp.yaml.

Edit Online

Edit csharp.yaml Locally

# 🔟❎ 'compile' C# ANTLR symbol scanner configuration

# The 'csharp' configuration instructs the ANTLR scanner which symbol values (e.g. class/func names, code constants)
# to extract from c# source files.  

# Set the 10x pipeline to 'compile'
tenx: compile

# ============================== ANTLR Options ================================

antlr:

  # The 'antlrLang' module is defined in: https://doc.log10x.com/compile/scanner/antlr/langs
  # These values define the ANTLR-generated parser class used to
  # to tokenize and construct an AST for the target c# file.

  lang: csharp

  fileExt:
    - .csx
    - .cs

  parserClass: com.log10x.antlr.generated.csharp.CSharpParser

  lexerClass: com.log10x.antlr.generated.csharp.CSharpLexer

  # 'charStreamClass' is set to remove language elements introduced later than V6.0 which is the latest release supported by this grammar.
  #  To learn more see https://github.com/antlr/grammars-v4/tree/master/csharp#readme
  charStreamClass: com.log10x.antlr.parsers.CSharpDowngrader

  rootRule: compilation_unit

  reserved:
    - namespace
    - interface
    - class
    - enum

  # 'lineFilters' specifies a list of regex patterns for skipping input lines 
  lineFilters:
    - ^\s*(//.*|/\*.*\*/)$

    # Remove unsupported C# v10 File scoped namespaces
    - ^namespace.*;$

    # Remove unsupported C# v9 Target-typed new operator
   # - "(?<!\\w)new\\s*\\(\\s*\\)\\s*;"

    - ".*new\\(.*"

  # The 'rule' list defines the nodes within a csharp AST tree that are captured
  # and added to the output symbol unit.
  # To learn more see: https://doc.log10x.com/compile/scanner/antlr/rules/
  rule:
    - name: interface_definition
      lang: csharp
      context: class
      recursive: false
      capture: allSymbols
      subRule: identifier

    - name: enum_member_declaration
      lang: csharp
      context: enum
      recursive: true
      capture: allSymbols

    - name: enum_definition
      lang: csharp
      context: class
      recursive: false
      capture: allSymbols
      subRule: identifier

    - name: namespace_body
      lang: csharp
      context: package
      recursive: true
      capture: literalsOnly

    - name: method_invocation
      lang: csharp
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: object_creation_expression
      lang: csharp
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: interpolated_regular_string_part
      lang: csharp
      context: method_invoke
      recursive: false
      capture: allSymbols
      subRule: interpolated_regular_string_part

    - name: namespace_declaration
      lang: csharp
      context: package
      recursive: true
      capture: allSymbols

    - name: statement_list
      lang: csharp
      context: method_invoke
      recursive: true
      capture: literalsOnly

    - name: constant_declaration
      lang: csharp
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: method_declaration
      lang: csharp
      context: method_decl
      recursive: true
      capture: allSymbols
      subRule: identifier

    - name: method_body
      lang: csharp
      context: method_decl
      recursive: false
      capture: literalsOnly

    #- name: method_member_name
    #  lang: csharp
    #  context: method_decl
    #  recursive: true
    #  capture: allSymbols

    - name: local_variable_initializer
      lang: csharp
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: returnStatement
      lang: csharp
      context: var_assign
      recursive: true
      capture: literalsOnly

    - name: class_definition
      lang: csharp
      context: class
      recursive: false
      capture: allSymbols
      subRule: identifier

    - name: attribute_argument
      lang: csharp
      context: annotation_invoke
      recursive: true
      capture: literalsOnly

Options

Specify the options below to configure the ANTLR scanner:

Name Description
antlrParseUnitTimeout Source file parse timeout

antlrParseUnitTimeout

Source file parse timeout.

Type Default
String ""

Defines the timeout interval for parsing and scanning a source/binary input file before aborting. default value: 60sec.


This module is defined in antlr/module.yaml.