19. Query with Upload

For each ObsTAP service, we query against our object list:

if not svc.upload_methods:
  return

result = vohelper.run_sync_resilient(svc,
  """SELECT TOP 2000 oc.obs_publisher_did, oc.access_url
      FROM ivoa.obscore AS oc
      JOIN TAP_UPLOAD.pois AS mine
      ON 1=CONTAINS(
        POINT('ICRS', oc.s_ra, oc.s_dec),
        CIRCLE('ICRS', mine.ra, mine.dec, 0.01))
      WHERE oc.dataproduct_type='spectrum'
      """),
      uploads = {"pois": pois})

What’s going on here? Right after constructing the service, we check whether it supports table uploads – not all TAP services do. TAPService objects have a few other attributes that let you figure things out about services. This, in particular, includes resource limits (maximum upload size, limit to which maxrec can be raised, etc).

Here, it’s enough to know there’s any upload method at all, because the standard says that inline upload must be supported if there’s any upload support.

To actually perform the upload, pass a dictionary to the uploads keyword argument of run_sync and friends. The keys there are simple names (starting with a letter and letters or numbers after that), the values can be various things, but you’ll probably get by passing either a string (which is interpreted as a URL to fetch a VOTable from) or an astropy table.

You can upload multiple tables using different keys; for each key, a table TAP_UPLOAD.key becomes available – in the example above, that’s TAP_UPLOAD.pois. You will almost always join the uploaded table with a table on the service, and thus it’s almost always a good idea to use ADQL’s AS construct to give abbreviated names to tables. The name mine is typically a good choice if you only have one upload.

Note that even if you don’t have to qualify column names in a query with a join with the source table names, you will regret not doing so in queries you will likely reuse (trap: you’ll always reuse the one you least expect to reuse) – just because there’s no column s_ra in the table uploaded here doesn’t mean the table you have in your next program doesn’t, and if it has and you’ve not used the oc. prefix (here), your function will fail.

Instead of the common run_sync, this uses vohelper.run_sync_resilient, which catches all kinds of exceptions and other trouble. As said above, when you do all-VO queries, expect at least one service to fail completely and another to give results that look like they come from a fuzzer.

The actual obscore query does a classical, ADQL 2.0 crossmatch, because we are querying lots of services, many of which will not be updated to more recent standards even by the time your read this. Also, spectra are essentially point-like objects, so you probably do not want to write the at the first glance more attractive alternative

1=CONTAINS(POINT('', up.ra, up.dec), s_region)

This could be more attractive if you’re looking for images or other artefacts with a reasonable coverage. Note, however, that proper s region support is not mandatory, whereas all data providers get the center RA and Dec for their datasets roughly right. The bottom line is: If you can get by with just positions (rather than s_region) in your obscore queries, do it.

The code in get_spectra.py is actually a bit more general in that it doesn’t hardcode the column names in the uploaded table but instead discovers them using UCDs. So, as long as your tables are properly annotated, the function there will just work for global spectra discovery (or, if you change the query, really any other global ObsCore discovery on sets of positions).

Problems

(1)

Can you figure out the default output limit (i.e., in effect an implied TOP) for the TAP service at http://dc.g-vo.org/tap? How far can you raise it?

Can you write a program that figures it out for all TAP services out there that talk about tgas?

(2)

(This is more of an async excercise, but you need the upload stuff to solve it)

One particularly cool part about async is that you can keep your results publicly available on the remote server for a while. That, in turn, you can use to do cross-service joins without having to download intermediate tables.

As said above, you can use URLs in a query’s upload argument. To try this out, review the TGAS and RAVE example above. Make the RAVE query asynchronous. Watch the resulting job, and when it is done, get the URI of the result table from the job’s result_uri attribute. Push that into an upload such that the following query does a join between the two datasets:

SELECT *
FROM
  tgas.main as tg
  JOIN TAP_UPLOAD.rave as mine
  ON (1=CONTAINS(
    POINT('', tg.ra, tg.dec),
    CIRCLE('', mine.raj2000, mine.dej2000, 1/3600.)))

Obviously, this is much more logical than the first version, since there’s just one constraint on the magnitudes now (the one on the H-band in rave) – when you send the resulting table to Aladin, you’ll see more matches in TGAS than you had when you were comparing the two catalog cuts manually.


Markus Demleitner, Hendrik Heinl