SSAP evolution 2012

Author: Markus Demleitner, Petr Skoda
email:msdemlei@ari.uni-heidelberg.de
Version: 0.2

Abstract

In this note we propose some additions to the SSAP 1.1 IVOA recommendation with the aim of stimulating the standard development. All changes proposed here provide upward and downward compatibility (i.e., valid services remain valid, and clients and servers will continue to interoperate). The proposals include enabling queries with object name patterns, discovering object names within a service, and a specification of the getData operation.

Specifying the getData operation

Rationale

For many applications, server-side manipulations of spectra are highly desirable. This concerns, in particular, cutouts which drastically reduce the amount of data that must be transferred in many common use cases (e.g., analysis of line profiles).

To support this, the client must have a means to transmit to the server its intent to only retrieve a part of the spectra offered, and they must be able to recognize services supporting server-side manipulations as well as discover what manipulations are supported.

The getData operation

The SSAP specification mentions a reserved getData operation without giving its semantics ([SSAP], sect. 5). This note defines its behaviour in order to support cutouts and other server-side manipulations.

Whenever this specification requires a response with a given status code, it is understood that this response is the final result of a request after processing redirects (30x status codes) and/or after requiring authentication (401).

An error response from the getData operation consists of a message in the format defined in [SSAP], sect. 8.10. In contrast to current SSAP practice, an error message SHOULD come with a 400 or 500 HTTP status code, depending on whether the server considers the error to be due to the input parameters (e.g., invalid generation parameters or combination of generation parameters) or to be due a malfunction on its side. For consistency with SSAP version 1, a 200 response code is acceptable, too.

Services supporting getData MUST accept a PUBDID query parameter to the getData operation and return a server-defined default representation (i.e., a spectrum in the service's preferred format) of the associated data set when queried with a PUBDID available on the system. All other PUBDIDs must return a 404 HTTP status code. A getData request without PUBDID MUST raise an error.

In addition, services MAY support additional SSA parameters or local generation parameters. In particular,

  • FORMAT -- allows the retrieval of a spectrum in some format convenient to the client.
  • BAND -- this allows cutouts; the band is specified as a (possibly half-open) interval of wavelengths in meters
  • SPECRP -- this allows server-side resampling
  • FLUXCALIB -- this may select, e.g., a continuum normalized version of a flux calibrated data set.

Even for SSAP defined parameters, no evaluation of SSAP metacharacters (comma, semicolon, slash) is supported, with the exception of BAND, where one single interval may be specified as in SSAP.

If a supported parameter has an unsupported value, the service MUST emit an error message. Unsupported parameter names SHOULD result in error messages. Requests resulting in an empty spectrum (e.g., a BAND specification outside of a data set's spectral coverage) MUST yield an error response.

Services that support getData MUST include a declaration to this effect in their VOTable responses to SSAP queryData requests. This is done using a TABLE element with name="generationParameters". It MUST declare all parameters legal in getData requests for the data sets described, even if they may only be pertinent to a subset of those.

Two sorts of parameters are supported per this specification:

  • enumerated parameters (e.g., FORMAT, FLUXCALIB). These MUST, in their VALUES child, enumerate all legal values for the parameter.
  • "continuous" parameters (e.g., BAND, SPECRP). These SHOULD, in their VALUES child, give minimal and maximal values sensible for the delivered data set

The line between enumerated and "continuous" is, of course, a bit fuzzy when dealing with digital data. Server operators should apply common sense here with respect to user interfaces here. The options of enumerated attributes should be suitable for popup menus, so even if your service is only capable of, say, 100 rebinning widths, enumerating all supported values would seem excessive, and giving minimal and maximal values should suffice, in particular if the service "rounds" to supported bin sizes. On the other hand, people cannot in general guess what strings a service supports in string-valued parameters, so enumerating those would seem necessary in almost all circumstances.

All such PARAM elements MUST have a DESCRIPTION child; if the parameters are part of the SSA data model, they MUST give the corresponding ssa utypes.

This table must be lexically behind the results table in order to not confuse clients not supporting the getData operation. It SHOULD be part of the same RESOURCE as the results table.

Again, the presence of a PARAM in a generationParameters table entails no guarantee that a given data set can actually be transformed in the way defined. Clients must thus be prepared for error responses on getData.

Here is an example for the generationParameters table:

<TABLE name="generationParameters">
  <PARAM name="FORMAT" datatype="char" arraysize="*"
    value="application/x-votable+xml">
    <VALUES>
      <OPTION value="application/x-votable+xml/>
      <OPTION value="text/plain"/>
      <OPTION value="application/fits"/>
    </VALUES>
  </PARAM>
  <PARAM name="BAND" datatype="float" unit="m">
    <VALUES>
      <MIN value="2e-7"/>
      <MAX value="8e-7"/>
    </VALUES>
  </PARAM>
</TABLE>

Enabling Queries With Patterns

Rationale

While SSAP queries using ICRS positions satisfy many use cases, in some cases queries by object names are either more convenient or completely unavoidable. Examples include solar system objects, components of multiple star systems, or exoplanets. However, nomenclature for such objects is not always well-defined. To facilitate exhaustive searches, a mechanism is required to allow a "fuzzy" specification of the value of the TARGETNAME SSAP parameter.

Having said that, we recommend that names of objects present in Simbad should be chosen to at least resolve in Simbad, and if at all possible, Simbad's preferred name should be chosen, excepting bright star common names.

The WILDTARGET and WILDTARGETCASE Protocol Parameters

Queries using fuzzy name matching use two new SSAP parameters, WILDTARGET and WILDTARGETCASE. Services supporting them MUST declare this in their Metadata response. Clients SHOULD only use them on services declaring support for them (by SSAP rules, a service will return all spectra on a wildcard query if it does not support those parameters, which is almost certainly confusing to the user).

No PQL metacharacters (comma, semicolon, slash) are interpreted in WILDTARGET*; services supporting WILDTARGET* at all MUST support multiple occurrences of WILDTARGET* in requests.

The difference between WILDTARGET and WILDTARGETCASE is that the former tries case-inensitive matching, whereas WILDTARGETCASE does not do any case normalization. All implementing SSAP services MUST support case normalization within the ASCII character set. SSAP services containing names outside of ASCII should perform case normalization as performed in the source language of the respective characters when confronted with case-insensitive queries.

The syntax of WILDTARGET patterns follows POSIX shell patterns and is defined as follows:

  • Except for the metacharacters, all characters match themselves
  • * is a metacharacter matching zero or more arbitrary characters.
  • ? is a metacharacter matching exactly one arbitrary character.
  • [char. seq] is a metacharacter sequence matching any character in char. seq.
  • [!char. seq] is a metacharacter sequence matching any character but one in char. seq.
  • \\ is a metacharacter escaping the next character, i.e., the next character is matched literally even if it is a metacharacter.

RE Syntax Alternatives

The authors are not entirely sure whether the adoption of POSIX shell patterns is the best choice. Here is the deliberation that made us choose it, using a discussion of the alternatives:

  1. POSIX shell patterns

Plus:

  • Fairly simple, may just suffice
  • Quite a few people are familiar with them

Minus:

  • Somewhat limited
  • Translation layer for the database necessary for most DBMSes
  1. SQL patterns

Metacharacters are % and _, escaped using a backslash.

Plus:

  • It's SQL standard, thus implementation effort is almost zero
  • It's consistent with ADQL, which astronomers should be learning anyway

Minus:

  • Even more limited than (1)
  • Few people are currently familiar with them
  1. DOS patterns

Metacharacters are * and ?

Plus:

  • Very little implementation effort (basically, string replacement on the pattern)
  • More people are familiar with them than with (2)

Minus:

  • Quoting is not defined (DOS doesn't allow metacharacters in file names)
  • Even more limited than (1)
  1. Some subset of Perl Regular Expressions

Case-sensitive by default, supporting at least ., *, [] as metacharacters (maybe more?); use, e.g., (?i) to switch to case-insensitive.

Plus:

  • Fairly many people know them
  • Powerful
  • Widespread support in programming languages, DB systems, etc
  • WILDTARGETCASE would not be necessary

Minus:

  • Complex, may be confusing to many astronomers
  • Need to define an appropriate subset
  1. "Google-Like"

Order of words doesn't matter; optional stemming or similar "fuzzy" matches should be allowed.

Plus:

  • Familiar to probably everyone
  • Lots of leeway to try and be "smart"

Minus:

  • Tokenization is complex with our nomenclature
  • Not very powerful, may not be useful with complex names
  • Most likely different servers would return different results for the same pattern
  • Significant implementation effort

Finding Out What Object Names the Archive Has

Rationale

As discussed in Enabling Queries With Patterns, support for searching for targets by name is a rather common requirement. Given the wide variety of nomenclatures in use in astronomy and their inconsistent application in input data that cannot always be repaired by service operators, allowing users to discover what nomenclatures are in use, how consistently they are applied, and what objects can be located by TARGETNAME queries is a useful addition to SSAP's target name capabilities.

Of course, a client could request all records and extract the target names from the result to obtain such a list. For usual services, however, the response to a query for target names exclusively will be orders of magnitude smaller than a queryData response since less data is transferred for each row and only a single row is transferred for potentially many rows.

The getTargetNames operation

SSAP services SHOULD support a REQUEST=getTargetNames operation. It works analoguous to the queryData operation, including support of all query parameters queryData has, except only one row will be returned for every distinct object name that would be present in an equivalent response to queryData. Each row contains exactly one column with a utype of ssa:Target.Name.

Services not supporting getTargetName will return an error message as per the SSAP specification.

Survey-type services may contain many thousands to many millions of object names. As with queryData, such services should overflow at MAXREC object names as defined in [SSAP]. Note again that getTargetNames evaluates the same parameters as queryData; thus, the problem can be overcome by, e.g., constraining the area queried in many scenarios.

Changes

From Version 0.1

  • getData without PUBID no longer returns a generic generationParameters table; instead, it is an error.
[SSAP]Tody, D. (ed): Simple Spectral Access Protocol, V. 1.1