32. TAP Uploads: The right way

TAP uploads are powerful, but they do have limits. In general, you cannot upload billion-row tables and expecte services to go along.

To make things fast and save the server’s resources, you should only upload enough to select the relevant data. So, avoid:

first_result = svc1.run_sync(...).to_table()
second_result = svc2.run_sync(
  "SELECT * FROM local.t JOIN TAP_UPLOAD.up as b USING (foo, bar)",
  uploads={"up": first_result})

– this will upload all of first_result and download it right again; transferring data you already have, ingesting it into the remote database in between is just a waste of resources.

Instead, if you want to join on first_result’s columns foo and bar, make a new local table containing just those plus a unique local identifier (add a record number if no such identifier exists), somewhat like this:

first_result = svc1.run_sync(...).to_table()
remote_match = svc2.run_sync(
  "SELECT * FROM local.t JOIN TAP_UPLOAD.up as b USING (foo, bar)",
  uploads={"up": table.Table([
    first_result["main_id"],
    first_result["foo"],
    first_result["bar"])})
second_result = table.join(
  first_result,
  remote_match,
  keys="main_id")

In practice, you might still run into resource limits when doing the upload join. In that case, you should first re-think what you’re doing. If there’s really no way around it, you can split up the remote action into bunches. The attached file shows that technique together with the upload column selection.

Files


Markus Demleitner, Hendrik Heinl