Package gavo :: Package adql :: Module grammar
[frames] | no frames]

Module grammar

source code

A parser for ADQL.

The grammar follows the official BNF grammar quite closely, except where pyparsing makes a different approach desirable; the names should mostly match except for the obious underscore to camel case map.

The grammar given in the spec has some nasty rules when you're parsing without backtracking and by recursive descent (which is what pyparsing does). I need some reformulations. The more interesting of those include:

TableReference

Trouble is that table_reference is left-recursive in the following rules:

<table_reference> ::=
       <table_name> [ <correlation_specification> ]
 | <derived_table> <correlation_specification>
 | <joined_table>

<joined_table> ::=
        <qualified_join>
      | <left_paren> <joined_table> <right_paren>

<qualified_join> ::=
        <table_reference> [ NATURAL ] [ <join_type> ] JOIN
        <table_reference> [ <join_specification> ]

We fix this by adding rules:

      <sub_join> ::= '(' <joinedTable> ')'
<join_opener> ::=
       <table_name> [ <correlation_specification> ]
 | <derived_table> <correlation_specification>
       | <sub_join>

and then writing:

<qualified_join> ::=
        <join_opener> [ NATURAL ] [ <join_type> ] JOIN
        <table_reference> [ <join_specification> ]

statement

I can't have StringEnd appended to querySpecification since it's used in subqueries, but I need to have it to keep pyparsing from just matching parts of the input. Thus, the top-level production is for "statement".

trig_function, math_function, system_defined_function

I think it's a bit funny to have the arity of functions in the syntax, but there you go. Anyway, I don't want to have the function names in separate symbols since they are expensive but go for a Regex (trig1ArgFunctionName). The only exception is ATAN since it has a different arity from the rest of the lot.

Similarly, for math_function I group symbols by arity.

The system defined functions are also regrouped to keep the number of symbols reasonable.

column_reference and below

Here the lack of backtracking hurts badly, since once, say, schema name is matched with a dot that's it, even if the dot should really have separated schema and table.

Hence, we don't assign semantic labels in the grammar but leave that to whatever interprets the tokens.

The important rules here are:

<column_name> ::= <identifier>
<correlation_name> ::= <identifier>
<catalog_name> ::= <identifier>
<unqualified_schema name> ::= <identifier>
<schema_name> ::= [ <catalog_name> <period> ] <unqualified_schema name>
<table_name> ::= [ <schema_name> <period> ] <identifier>
<qualifier> ::= <table_name> | <correlation_name>
<column_reference> ::= [ <qualifier> <period> ] <column_name>

By substitution, one has:

<schema_name> ::= [ <identifier> <period> ] <identifier>

hence:

<table_name> ::= [[ <identifier> <period> ] <identifier> <period> ]
        <identifier>

hence:

<qualifier> ::= [[ <identifier> <period> ] <identifier> <period> ]
        <identifier>

(which matches both table_name and correlation_name) and thus:

<column_reference> ::= [[[ <identifier> <period> ] <identifier> <period> ]
        <identifier> <period> ] <identifier>

We need the table_name, qualifier, and column_reference productions.

generalLiterals in unsigngedLiterals

One point I'm deviating from the published grammar is that I disallow generalLiterals in unsignedLiterals. Allowing them would let pyparsing match a string literal as a numericValueLiteral, which messes up string expressions. I'm not sure why generalLiterals are allowed in there anyway. If this bites at some point, we'll face a major rewrite of the grammar (or we need to dump pyparsing).

To make the whole thing work, I added the generalLiteral to the characterPrimary production.

Classes
  RegularIdentifier
regular identifiers are all C-style identifiers except reserved words.
  LongestMatch
pyparsing's Or, except that ParseFatalExceptions are still propagated.
Functions
 
Args(pyparseSymbol)
wraps pyparseSymbol such that matches get added to an args list on the parent node.
source code
 
getADQLGrammarCopy()
returns a pair symbols, selectSymbol for a grammar parsing ADQL.
source code
 
enableDebug(syms, debugNames=None) source code
 
enableTree(syms) source code
 
getADQLGrammar()
returns a pair of (symbols, root) for an ADQL grammar.
source code
Variables
  adqlReservedWords = set(['ABS', 'ACOS', 'AREA', 'ASIN', 'ATAN'...
  sqlReservedWords = set(['ABSOLUTE', 'ACTION', 'ADD', 'ALL', 'A...
  allReservedWords = set(['ABS', 'ABSOLUTE', 'ACOS', 'ACTION', '...
  userFunctionPrefix = '(gavo|ivo)'
  __package__ = 'gavo.adql'
Function Details

getADQLGrammarCopy()

source code 

returns a pair symbols, selectSymbol for a grammar parsing ADQL.

You should only use this if you actually require a fresh copy of the ADQL grammar. Otherwise, use getADQLGrammar or a wrapper function defined by a client module.

getADQLGrammar()

source code 

returns a pair of (symbols, root) for an ADQL grammar.

This probably is mainly useful for testing. At least you should not set names or parseActions on whatever you are returned unless you are testing.


Variables Details

adqlReservedWords

Value:
set(['ABS',
     'ACOS',
     'AREA',
     'ASIN',
     'ATAN',
     'ATAN2',
     'BITWISE_AND',
     'BITWISE_NOT',
...

sqlReservedWords

Value:
set(['ABSOLUTE',
     'ACTION',
     'ADD',
     'ALL',
     'ALLOCATE',
     'ALTER',
     'AND',
     'ANY',
...

allReservedWords

Value:
set(['ABS',
     'ABSOLUTE',
     'ACOS',
     'ACTION',
     'ADD',
     'ALL',
     'ALLOCATE',
     'ALTER',
...