gavo.adql.grammar module

A parser for ADQL.

The grammar follows the official BNF grammar quite closely, except where pyparsing makes a different approach desirable; the names should mostly match except for the obvious underscore to camel case map.

The grammar given in the spec has some nasty rules when you’re parsing without backtracking and by recursive descent (which is what pyparsing does). I need some reformulations. The more interesting of those include:

TableReference

Trouble is that table_reference is left-recursive in the following rules:

<table_reference> ::=
       <table_name> [ <correlation_specification> ]
 | <derived_table> <correlation_specification>
 | <joined_table>

<joined_table> ::=
        <qualified_join>
      | <left_paren> <joined_table> <right_paren>

<qualified_join> ::=
        <table_reference> [ NATURAL ] [ <join_type> ] JOIN
        <table_reference> [ <join_specification> ]

We fix this by adding rules:

      <sub_join> ::= '(' <joinedTable> ')'
<join_opener> ::=
       <table_name> [ <correlation_specification> ]
 | <derived_table> <correlation_specification>
       | <sub_join>

and then writing:

<qualified_join> ::=
        <join_opener> [ NATURAL ] [ <join_type> ] JOIN
        <table_reference> [ <join_specification> ]

statement

I can’t have StringEnd appended to querySpecification since it’s used in subqueries, but I need to have it to keep pyparsing from just matching parts of the input. Thus, the top-level production is for “statement”.

trig_function, math_function, system_defined_function

I think it’s a bit funny to have the arity of functions in the syntax, but there you go. Anyway, I don’t want to have the function names in separate symbols since they are expensive but go for a Regex (trig1ArgFunctionName). The only exception is ATAN since it has a different arity from the rest of the lot.

Similarly, for math_function I group symbols by arity.

The system defined functions are also regrouped to keep the number of symbols reasonable.

column_reference and below

Here the lack of backtracking hurts badly, since once, say, schema name is matched with a dot that’s it, even if the dot should really have separated schema and table.

Hence, we don’t assign semantic labels in the grammar but leave that to whatever interprets the tokens.

The important rules here are:

<column_name> ::= <identifier>
<correlation_name> ::= <identifier>
<catalog_name> ::= <identifier>
<unqualified_schema name> ::= <identifier>
<schema_name> ::= [ <catalog_name> <period> ] <unqualified_schema name>
<table_name> ::= [ <schema_name> <period> ] <identifier>
<qualifier> ::= <table_name> | <correlation_name>
<column_reference> ::= [ <qualifier> <period> ] <column_name>

By substitution, one has:

<schema_name> ::= [ <identifier> <period> ] <identifier>

hence:

<table_name> ::= [[ <identifier> <period> ] <identifier> <period> ]
        <identifier>

hence:

<qualifier> ::= [[ <identifier> <period> ] <identifier> <period> ]
        <identifier>

(which matches both table_name and correlation_name) and thus:

<column_reference> ::= [[[ <identifier> <period> ] <identifier> <period> ]
        <identifier> <period> ] <identifier>

We need the table_name, qualifier, and column_reference productions.

generalLiterals in unsigngedLiterals

One point I’m deviating from the published grammar is that I disallow generalLiterals in unsignedLiterals. Allowing them would let pyparsing match a string literal as a numericValueLiteral, which messes up string expressions. I’m not sure why generalLiterals are allowed in there anyway. If this bites at some point, we’ll face a major rewrite of the grammar (or we need to dump pyparsing).

To make the whole thing work, I added the generalLiteral to the characterPrimary production.

gavo.adql.grammar.Args(pyparseSymbol)[source]

wraps pyparseSymbol such that matches get added to an args list on the parent node.

class gavo.adql.grammar.LongestMatch(exprs, savelist=False)[source]

Bases: ParseExpression

pyparsing’s Or, except that ParseFatalExceptions are still propagated.

checkRecursion(parseElementList)[source]
parseImpl(instring, loc, doActions=True)[source]
class gavo.adql.grammar.RegularIdentifier(reservedWords)[source]

Bases: Word

regular identifiers are all C-style identifiers except reserved words.

Filtering these in the parse action doesn’t always work properly for all versions of pyparsing, thus this special class.

reservedWords are assumed to be in upper case, but matching case-insensitively.

parseImpl(instring, loc, doActions=True)[source]
gavo.adql.grammar.enableDebug(syms, debugNames=None)[source]
gavo.adql.grammar.enableTree(syms)[source]
gavo.adql.grammar.getADQLGrammar()[source]

returns a pair of (symbols, root) for an ADQL grammar.

This probably is mainly useful for testing. At least you should not set names or parseActions on whatever you are returned unless you are testing.

gavo.adql.grammar.getADQLGrammarCopy(nodes)[source]

returns a pair symbols, selectSymbol for a grammar parsing ADQL.

You should only use this if you actually require a fresh copy of the ADQL grammar. Otherwise, use getADQLGrammar or a wrapper function defined by a client module.