Articles from Software

  • A New Constraint Class in PyVO's Registry API: UAT

    A scan of a book page: lots of astronomy-relevant topics ranging from "Cronometrie" to "Kosmologie, Relativitätstheorie".  Overlaid a title page stating "Astronomischer Jahresbericht.  Die Literatur des Jahres 1967".

    This was how they did what I am talking about here almost 60 years ago: a page of the table of contents of the “Astronomischer Jahresbericht” for 1967, the last volume before it was turned into the English-language Astronomy and Astrophysics Abstracts, which were the main tool for literature work in astronomy until the ADS came along in the late 1990ies.

    I have recently created a pull request against pyVO to furnish the library with a new constraint to search for data and services: Search by a concept drawn from the Unified Astronomy Thesaurus UAT. This is not entirely different from the classical search by subject keywords that was what everyone did before we had the ADS, which is what I am trying to illustrate above. But it has some twists that, I would argue, still make it valuable even in the age of full-text indexes.

    To make my argument, let me first set the stage.

    Thesauri and the UAT

    (Disclaimer: I am currently a member of the UAT steering committee and therefore cannot claim neutrality. However, I would not claim neutrality otherwise, either: the UAT is not perfect, but it's already great)

    Librarians (and I am one at heart) love thesauri. Or taxonomies. Or perhaps even ontologies. What may sound like things out of a Harry Potter novel are actually ways to organise a part of the world (a “domain”) into “concepts”. If you are suitably minded, you can think of a “concept“ as a subset of the domain; “suitably minded“ here means that you consider the world as a large set of things and a domain a subset of this world. The IVOA Vocabularies specification contains some additional philosophical background on this way of thinking in sect. 5.2.4.

    On this other hand, if you are not suitably minded, a “concept” is not much different from a topic.

    There are differences in how each of thesaurus, taxonomy, and ontology does that organising (and people don't always agree on the differences). Ontologies, for instance, let you link concepts in every way, as in “a (bicycle) (is steered) (using) a (handle bar) (made of) ((steel) or (aluminum))“; every parenthesised phrase would be a node (which is a better term in ontologies than “concept”) in a suitably general ontology, and connecting these nodes creates a fine-graned representation of knowledge about the world.

    That is potentially extremely powerful, but also almost too hard for humans. Check out WordNet for how far one can take ontologies if very many very smart people spend very many years.

    Thesauri, on the other hand, are not as powerful, but they are simpler and within reach for mere humans: there, concepts are simply organised into something like a tree, perhaps (and that is what many people would call a taxonomy) using is-a relationships: A human is a primate is a mammal is a vertebrate is an animal. The UAT actually is using somewhat vaguer notions called “narrower” and “wider”. This lets you state useful if somewhat loose relationships like “asteroid-rotation is narrower than asteroid-dynamics”. For experts: The UAT is using a formalism called SKOS; but don't worry if you can't seem to care.

    The UAT is standing on the shoulders of giants: Before it, there has been the IAU thesaurus in 1993, and an astronomy thesaurus was also produced under the auspices of the IVOA. And then there were (and to some extent still are) the numerous keyword schemes designed by journal publishers that would also count as some sort of taxonomy or astronomy.

    “Numerous” is not good when people have to assign keywords to their journal articles: If A&A use something drastically or only subtly different from ApJ, and MNRAS still something else, people submitting to multiple journals will quite likely lose their patience and diligence with the keywords. For reasons I will discuss in a second, that is a shame.

    Therefore, at least the big American journals have now all switched to using UAT keywords, and I sincerely hope that their international counterparts will follow their example where that has not already happened.

    Why Keywords?

    Of course, you can argue that when you can do full-text searches, why would you even bother with controlled keyword lists? Against that, I would first argue that it is extremely useful to have a clear idea of what a thing is called: For example, is it delta Cephei stars, Cepheids, δ Cep stars or still something else? Full text search would need to be rather smart to be able to sort out terminological turmoil of this kind for you.

    And then you would still not know if W Virginis stars (or should you say “Type II Cepheids”? You see how useful proper terminology is) are included in whatever your author called Cepheids (or whatever they called it). Defining concepts as precisely as possible thus is already great.

    The keyword system becomes even more useful when the hiearchy we see in the Cepheid example becomes visible to computers. If a computer knows that there is some relationship between W Virgins stars and classical Cepheids, it can, for instance, expand or refine your queries (“give me data for all kinds of Cepheids”) as necessary. To give you an idea of how this looks in practice, here is how SemBaReBro displays the Cepheid area in the UAT:

    Arrows between texts like "Type II Cepheid variable stars", "Cepheid variable stars", and "Young disk Cepheid variable stars"

    In that image, only concepts associated with resources in the Registry have a spiffy IVOA logo; that so few VO resources claim to deal with Cepheids tells you that our data providers can probably improve their annotations quite a bit. But that is for another day; the hope is that as more people search using UAT concepts, the data providers will see a larger benefit in choosing them wisely[1].

    By the way, if you are a regular around here, you will have seen images like that before; I have talked about Sembarebro in 2021 already, and that post contains more reasons for having and maintaining vocabularies.

    Oh, and for the definitions of the concepts, you can (in general; in the UAT, there are still a few concepts without definitions) dereference the concept URI, which in the VO is always of the form <vocabulary uri>#<term identifier>, where the vocabulary URI starts with http://www.ivoa.net/rdf, after which there is the vocabulary name.

    Thus, if you point your web browser to https://www.ivoa.net/rdf/uat#cepheid-variable-stars[2], you will learn that a Cepheid is:

    A class of luminous, yellow supergiants that are pulsating variables and whose period of variation is a function of their luminosity. These stars expand and contract at extremely regular periods, in the range 1-50 days [...]

    The UAT constraint

    Remember? This was supposed to be a blog post about a new search constraint in pyVO. Well, after all the preliminaries I can finally reveal that once pyVO PR #649 is merged, you can search by UAT concepts:

    >>> from pyvo import registry
    >>> print(registry.search(registry.UAT("variable-stars")))
    <DALResultsTable length=2010>
                  ivoid               ...
                                      ...
                  object              ...
    --------------------------------- ...
             ivo://cds.vizier/b/corot ...
              ivo://cds.vizier/b/gcvs ...
               ivo://cds.vizier/b/vsx ...
              ivo://cds.vizier/i/280b ...
               ivo://cds.vizier/i/345 ...
               ivo://cds.vizier/i/350 ...
                                  ... ...
                ivo://cds.vizier/v/97 ...
             ivo://cds.vizier/vii/293 ...
       ivo://org.gavo.dc/apass/q/cone ...
    ivo://org.gavo.dc/bgds/l/meanphot ...
         ivo://org.gavo.dc/bgds/l/ssa ...
         ivo://org.gavo.dc/bgds/q/sia ...
    

    In case you have never used pyVO's Registry API before, you may want to skim my post on that topic before continuing.

    Since the default keyword search also queries RegTAP's res_subject table (which is what this constraint is based on), this is perhaps not too exciting. At least there is a built-in protection against typos:

    >>> print(registry.search(registry.UAT("varialbe-stars")))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/msdemlei/gavo/src/pyvo/pyvo/registry/rtcons.py", line 713, in __init__
        raise dalq.DALQueryError(
    pyvo.dal.exceptions.DALQueryError: varialbe-stars does not identify an IVOA uat concept (see http://www.ivoa.net/rdf/uat).
    

    It becomes more exciting when you start exploiting the intrinsic hierarchy; the constraint constructor supports optional keyword arguments expand_up and expand_down, giving the number of levels of parent and child concepts to include. For instance, to discover resources talking about any sort of supernova, you would say:

    >>> print(registry.search(registry.UAT("supernovae", expand_down=10)))
    <DALResultsTable length=593>
                     ivoid                   ...
                                             ...
                     object                  ...
    ---------------------------------------- ...
                       ivo://cds.vizier/b/sn ...
                     ivo://cds.vizier/ii/159 ...
                     ivo://cds.vizier/ii/189 ...
                     ivo://cds.vizier/ii/205 ...
                    ivo://cds.vizier/ii/214a ...
                     ivo://cds.vizier/ii/218 ...
                                         ... ...
               ivo://cds.vizier/j/pasp/122/1 ...
           ivo://cds.vizier/j/pasp/131/a4002 ...
               ivo://cds.vizier/j/pazh/30/37 ...
              ivo://cds.vizier/j/pazh/37/837 ...
    ivo://edu.gavo.org/eurovo/aida_snconfirm ...
                    ivo://mast.stsci/candels ...
    

    There is no overwhelming magic in this, as you can see when you tell pyVO to show you the query it actually runs:

    >>> print(registry.get_RegTAP_query(registry.UAT("supernovae", expand_down=10)))
    SELECT
      [crazy stuff elided]
    WHERE
    (ivoid IN (SELECT DISTINCT ivoid FROM rr.res_subject WHERE res_subject in (
      'core-collapse-supernovae', 'hypernovae', 'supernovae',
      'type-ia-supernovae', 'type-ib-supernovae', 'type-ic-supernovae',
      'type-ii-supernovae')))
    GROUP BY [whatever]
    

    Incidentally, some services have an ADQL extension (a “user defined function“ or UDF) that lets you do these kinds of things on the server side; that is particularly nice when you do not have the power of Python at your fingertips, as for instance interactively in TOPCAT. This UDF is:

    gavo_vocmatch(vocname STRING, term STRING, matchagainst STRING) -> INTEGER
    

    (documentation at the GAVO data centre). There are technical differences, some of which I try to explain in amoment. But if you run something like:

    SELECT ivoid FROM rr.res_subject
    WHERE 1=gavo_vocmatch('uat', 'supernovae', res_subject)
    

    on the TAP service at http://dc.g-vo.org/tap, you will get what you would get with registry.UAT("supernovae", expand_down=1). That UDF also works with other vocabularies. I particularly like the combination of product-type, obscore, and gavo_vocmatch.

    If you wonder why gavo_vocmatch does not go on expanding towards narrower concepts as far as it can go: That is because what pyVO does is semantically somewhat questionable.

    You see, SKOS' notions of what is wider and narrower are not transitive. This means that just because A is wider than B and B is wider than C it is not certain that A is wider than C. In the UAT, this sometimes leads to odd results when you follow a branch of concepts toward narrower concepts, mostly because narrower sometimes means part-of (“Meronymy”) and sometimes is-a (“Hyponymy“). Here is an example discovered by my colleague Adrian Lucy:

    interstellar-medium wider nebulae wider emission-nebulae wider planetary-nebulae wider planetary-nebulae-nuclei

    Certainly, nobody would argue that that the central stars of planetary nebulae somehow are a sort of or are part of the interstellar medium, although each individual relationship in that chain makes sense as such.

    Since SKOS relationships are not transitive, gavo_vocmatch, being a general tool, has to stop at one level of expansion. By the way, it will not do that for the other flavours of IVOA vocabularies, which have other (transitive) notions of narrower-ness. With the UAT constraint, I have fewer scruples, in particular since the expansion depth is under user control.

    Implementation

    Talking about technicalities, let me use this opportunity to invite you to contribute your own Registry constraints to pyVO. They are not particularly hard to write if you know both ADQL and Python. You will find several examples – between trivial and service-sensing complex in pyvo.registry.rtcons. The code for UAT looks like this (documentation removed for clarity[3]):

    class UAT(SubqueriedConstraint):
        _keyword = "uat"
        _subquery_table = "rr.res_subject"
        _condition = "res_subject in {query_terms}"
        _uat = None
    
        @classmethod
        def _expand(cls, term, level, direction):
            result = {term}
            new_concepts = cls._uat[term][direction]
            if level:
                for concept in new_concepts:
                    result |= cls._expand(concept, level-1, direction)
            return result
    
        def __init__(self, uat_keyword, *, expand_up=0, expand_down=0):
            if self.__class__._uat is None:
                self.__class__._uat = vocabularies.get_vocabulary("uat")["terms"]
    
            if uat_keyword not in self._uat:
                raise dalq.DALQueryError(
                    f"{uat_keyword} does not identify an IVOA uat"
                    " concept (see http://www.ivoa.net/rdf/uat).")
    
            query_terms = {uat_keyword}
            if expand_up:
                query_terms |= self._expand(uat_keyword, expand_up, "wider")
            if expand_down:
                query_terms |= self._expand(uat_keyword, expand_down, "narrower")
    
            self._fillers = {"query_terms": query_terms}
    

    Let me briefly describe what is going on here. First, we inherit from the base class SubqueriedConstraint. This is a class that takes care that your constraints are nicely encapsulated in a subquery, which generally is what you want in pyVO. Calmly adding natural joins as recommended by the RegTAP specification is a dangerous thing for pyVO because as soon as a resource matches your constraint more than once (think “columns with a given UCD”), the RegistryResult lists in pyVO will turn funny.

    To make a concrete SubqueriedConstraint, you have to fill out:

    • the table it will operate on, which is in the _subquery_table class attribute;
    • an expression suitable for a WHERE clause in the _condition attribute, which is a template for str.format. This is often computed in the constructor, but here it is just a constant expression and thus works fine as a class attribute;
    • a mapping _fillers mapping the substitutions in the _condition string template to Python values. PyVO's RegTAP machinery will worry about making SQL literals out of these, so feel free to just dump Python values in there. See the make_SQL_literal for what kinds of types it understands and expand it as necessary.

    There is an extra class attribute called _keyword. This is used by the pyvo.regtap machinery to let users say, for instance, registry.search(uat="foo.bar") instead of registry.search(registry.UAT("foo.bar")). This is a fairly popular shortcut when your constraints can be expressed as simple strings, but in the case of the UAT constraint you would be missing out on all the interesting functionality (viz., the query expansion that is only available through optional arguments to its constructor).

    This particular class has some extra logic. For one, we cache a copy of the UAT terms on first use at the class level. That is not critical for performance because caching already happens at the level of get_vocabulary; but it is convenient when we want query expansion in a class method, which in turn to me feels right because the expansion does not depend on the instance. If you don't grok the __class__ magic, don't worry. It's a nerd thing.

    More interesting is what happens in the _expand class method. This takes the term to expand, the number of levels to go, and whether to go up or down in the concept trees (which are of the computer science sort, i.e., with the root at the top) in the direction argument, which can be wider or narrower, following the names of properties in Desise, the format we get our vocabulary in. To learn more about Desise, see section 3.2 of Vocabularies in the VO 2.

    At each level, the method now collects the wider or narrower terms, and if there are still levels to include, calls itself on each new term, just with level reduced by one. I consider this a particularly natural application of recursion. Finally. everything coming back is merged into a set, which then is the return value.

    And that's really it. Come on: write your own RegTAP constraints, and also have fun with vocabularies. As you see here, it's really not that magic.

    [1]Also, just so you don't leave with the impression I don't believe in AI tech at all, something like SciX's KAILAS might also help improving Registry subject keywords.
    [2]Yes, in a little sleight of hand, I've switched the URI scheme to https here. That's not really right, because the term URIs are supposed to be opaque, but some browsers currently forget the fragment identifiers when the IVOA web server redirects them to https, and so https is safer for this demonstration. This is a good example of why the web would be a better place if http had been evolved to support transparent, client-controlled encryption (rather than inventing https).
    [3]I've always wanted to write this.
  • DaCHS 2.11: Persistent TAP Uploads

    The DaCHS logo, a badger's head and the text "VO Data Publishing"

    The traditional autumn release of GAVO's server package DaCHS is somewhat late this year, but not so late that could not still claim it comes after the interop. So, here it is: DaCHS 2.11 and the traditional what's new post.

    But first, while I may have DaCHS operators' attention: If you have always wondered why things in DaCHS are as they are, you will probably enjoy the article Declarative Data Publication with DaCHS, which one day will be in the proceedings of ADASS XXXIV (and before that probably on arXiv). You can read it in a pre-preprint version already now at https://docs.g-vo.org/I301.pdf, and feedback is most welcome.

    Persistent TAP Uploads

    The potentially most important new feature of DaCHS 2.11 (in my opinion) will not be news to regular readers of this blog: Persistent TAP Uploads.

    At this point, no client supports this, and presumably when clients do support it, it will look somewhat different, but if you like the bleeding edge and have users that don't mind an occasional curl or requests call, you would be more than welcome to help try the persistent uploads. As an operator, it should be sufficient to type:

    dachs imp //tap_user
    

    To make this more useful, you probably want to hand out proper credentials (make them with dachs adm adduser) to people who want to play with this, and point the interested users to the demo jupyter notebook.

    I am of course grateful for any feedback, in particular on how people find ways to use these features to give operators a headache. For instance, I really would like to avoid writing a quota system. But I strongly suspect will have to…

    On-loaded Execute-s

    DaCHS has a built-in cron-type mechanism, the execute Element. So far, you could tell it to run jobs every x seconds or at certain times of the day. That is fine for what this was made for: updates of “living” data. For instance, the RegTAP RD (which is what's behind the Registry service you are probably using if you are reading this) has something like this:

    <execute title="harvest RofR" every="40000">
      <job><code>
          execDef.spawnPython("bin/harvestRofR.py")
      </code></job>
    </execute>
    

    This will pull in new publishing registries from the Registry of Registries, though that is tangential; the main thing is that some code will run every 40 kiloseconds (or about 12 hours).

    Against using plain cron, the advantage is that DaCHS knows context (for instance, the RD's resdir is not necessary in the example call), that you can sync with DaCHS' own facilities, and most of all that everything is in once place and can be moved together. By the way, it is surprisingly simple to run a RegTAP service of your own if you already run DaCHS. Feel free to inquire if you are interested.

    In DaCHS 2.11, I extended this facility to include “events” in the life of an RD. The use case seems rather remote from living data: Sometimes you have code you want to share between, say, a datalink service and some ingestion code. This is too resource-bound for keeping it in the local namespace, and that would again violate RD locality on top.

    So, the functions somehow need to sit on the RD, and something needs to stick them there. To do that, I recommended a rather hacky technique with a LOOP with codeItems in the respective howDoI section. But that was clearly rather odious – and fragile on top because the RD you manipulated was just being parsed (but scroll down in the howDoI and you will still see it).

    Now, you can instead tell DaCHS to run your code when the RD has finished loading and everything should be in place. In a recent example I used this to have common functions to fetch photometric points. In an abridged version:

    <execute on="loaded" title="define functions"><job>
      <setup imports="h5py, numpy"/>
      <code>
      def get_photpoints(field, quadrant, quadrant_id):
        """returns the photometry points for the specified time series
        from the HDF5 as a numpy array.
    
        [...]
        """
        dest_path = "data/ROME-FIELD-{:02d}_quad{:d}_photometry.hdf5".format(
          field, quadrant)
        srchdf = h5py.File(rd.getAbsPath(dest_path))
        _, arr = next(iter(srchdf.items()))
    
        photpoints = arr[quadrant_id-1]
        photpoints = numpy.array(photpoints)
        photpoints[photpoints==0] = numpy.nan
        photpoints[photpoints==-9999.99] = numpy.nan
    
        return photpoints
    
    
      def get_photpoints_for_rome_id(rome_id):
        """as get_photpoints, but taking an integer rome_id.
        """
        field = rome_id//10000000
        quadrant = (rome_id//1000000)%10
        quadrant_id = (rome_id%1000000)
        base.ui.notifyInfo(f"{field} {quadrant} {quadrant_id}")
        return get_photpoints(field, quadrant, quadrant_id)
    
      rd.get_photpoints = get_photpoints
      rd.get_photpoints_for_rome_id = get_photpoints_for_rome_id
    </code></job></execute>
    

    (full version; if this is asking you to log in, tell your browser not to wantonly switch to https). What is done here in detail again is not terribly relevant: it's the usual messing around with identifiers and paths and more or less broken null values that is a data publisher's everyday lot. The important thing is that with the last two statements, you will see these functions whereever you see the RD, which in RD-near Python code is just about everywhere.

    dachs start taptable

    Since 2018, DaCHS has supported kickstarting the authoring of RDs, which is, I claim, the fun part of a data publisher's tasks, through a set of templates mildly customised by the dachs start command. Nobody should start a data publication with an empty editor window any more. Just pass the sort of data you would like to publish and start answering sensible questions. Well, “sort of data” within reason:

    $ dachs start list
    epntap -- Solar system data via EPN-TAP 2.0
    siap -- Image collections via SIAP2 and TAP
    scs -- Catalogs via SCS and TAP
    ssap+datalink -- Spectra via SSAP and TAP, going through datalink
    taptable -- Any sort of data via a plain TAP table
    

    There is a new entry in this list in 2.11: taptable. In both my own work and watching other DaCHS operators, I have noticed that my advice “if you want to TAP-publish any old material, just take the SCS template and remove everything that has scs in it” was not a good one. It is not as simple as that. I hope taptable fits better.

    A plan for 2.12 would be to make the ssap+datalink template less of a nightmare. So far, you basically have to fill out the whole thing before you can start experimenting, and that is not right. Being able to work incrementally is a big morale booster.

    VOTable 1.5

    VOTable 1.5 (at this point still a proposed recommendation) is a rather minor, cleanup-type update to the VO's main table format. Still, DaCHS has to say it is what it is if we want to be able to declare refposition in COOSYS (which we do). Operators should not notice much of this, but it is good to be aware of the change in case there are overeager VOTable parsers out there or in case you have played with DaCHS MIVOT generator; in 2.10, you could ask it to do its spiel by requesting the format application/x-votable+xml;version=1.5. In 2.11, it's application/x-votable+xml;version=1.6. If you have no idea what I was just saying, relax. If this becomes important, you will meet it somewhere else.

    Minor Changes

    That's almost it for the more noteworthy news; as usual, there are a plethora of minor improvements, bug fixes and the like. Let me briefly mention a few of these:

    • The ADQL form interface's registry record now includes the site name. In case you are in this list, please say dachs pub //adql after upgrading.
    • More visible legal info, temporal, and spatial coverage in table and service infos; one more reason to regularly run dachs limits!
    • VOUnit's % is now known to DaCHS (it should have been since about 2.9)
    • More vocabulary validation for VOResource generation; so, dachs pub might now complain to you when it previously did not. It is now right and was wrong before.
    • If you annotate a column as meta.bib.bibcode, it will be rendered as ADS links
    • The RD info links to resrecs (non-DaCHS resources, essentially), too.

    Upgrade As Convenient

    As usual, if you have the GAVO repository enabled, the upgrade will happen as part of your normal Debian apt upgrade. Still, if you have not done so recently, have a quick look at upgrading in the tutorial. If, on the other hand, you use the Debian-distributed DaCHS package and you do not need any of the new features, you can let things sit and enjoy the new features after your next dist-upgrade.

  • What's new in DaCHS 2.10

    A part of the IVOA product-type vocabulary, and the DaCHS logo with a 2.10 behind it.

    About twice a year, I release a new version of our VO server package DaCHS; in keeping with tradition, this post summarises some of the more notable changes of the most recent release, DaCHS 2.10.

    productTypeServed

    The next version of VODataService will probably have a new element for service descriptions: productTypeServed. This allows operators to declare what sort of files will come out of a service: images, time series, spectra, or some of the more exotic stuff found in the IVOA product-type vocabulary (you can of course give multiple of these). More on where this is supposed to go is found my Interop talk on this. DaCHS 2.10 now lets you declare what to put there using a productTypeServed meta item.

    For SIA and SSAP services, there is usually no need to give it, as RegTAP services will infer the right value from the service type. But if you serve, say, time series from SSAP, you can override the inference by saying something like:

    <meta name="productTypeServed">timeseries</meta>
    

    Where this really is important is in obscore, because you can serve any sort of product through a single obscore table. While you could manually declare what you serve by overriding obscore-extraevents in your userconfig RD, this may be brittle and will almost certainly get out of date. Instead, you can run dachs limits //obscore (and you should do that occasionally anyway if you have an obscore table). DaCHS will then feed the meta from what is in your table.

    A related change is that where a piece of metadata is supposed to be drawn from a vocabulary, dachs val will now complain if you use some other identifier. As of DaCHS 2.10 the only metadata item controlled in this way is productTypeServed, though.

    Registering Obscore Tables

    Speaking about Obscore: I have long been unhappy about the way we register Obscore tables. Until now, they rode piggyback in the registry record of the TAP services they were queriable through. That was marignally acceptable as long as we did not have much VOResource metadata specific to the Obscore table. In the meantime, we have coverage in space, time, and spectrum, and there are several meaningful relationships that may be different for the obscore table than for the TAP service. And since 2019, we have the Discovering Data Collections Note that gives a sensible way to write dedicated registry records for obscore tables.

    With the global dataset discovery (discussed here in February) that should come with pyVO 1.6 (and of course the productTypeServed thing just discussed), there even is a fairly pressing operational reason for having these dedicated obscore records. There is a draft of a longer treatment on the background on github (pre-built here) that I will probably upload into the IVOA document repository once the global discovery code has been merged. Incidentally, reviews of that draft before publication are most welcome.

    But what this really means: If you have an obscore table, please run dachs pub //obscore after upgrading (and don't forget to run dachs limits //obscore after you do notable changes to your obscore table).

    Ranking

    Arguably the biggest single usability problem of the VO is <drumroll> sorting! Indeed, it is safe to assume that when someone types “Gaia DR3“ into any sort of search mask, they would like to find some way to query Gaia's gaia_source table (and then perhaps all kinds of other things, but that should reasonably be sorted below even mirrors of gaia_source. Regrettably, something like that is really hard to work out across the Registry outside of these very special cases.

    Within a data centre, however, you can sensibly give an order to things. For DaCHS, that in particular concerns the order of tables in TAP clients and the order of the various entries on the root page. For instance, a recent TOPCAT will show the table browser on the GAVO data centre like this:

    Screenshot of a hierachical display, top-level entries are, in that order, ivoa, tap_schema, bgds, califadr3; ivoa is opened and shows obscore and obs_radio, califadr3 is opened and shows cubes first, then fluxpos tables and finally flux tables.

    The idea is that obscore and TAP metadata are way up, followed by some data collections with (presumably) high scientific value for which we are the primary site; within the califadr3 schema, the tables are again sorted by relevance, as most people will be interested in the cubes first, the somewhat funky fluxpos tables second, and in the entirely nerdy flux tables last.

    You can arrange this by assigning schema-rank metadata at the top level of an RD, and table-rank metadata to individual tables. In both cases, missing ranks default to 10'000, and the lower a rank, the higher up a schema or table will be shown. For instance, dfbsspec/q (if you wonder what that might be: see Byurakan to L2) has:

    <resource schema="dfbsspec">
      <meta name="schema-rank">100</meta>
        ...
        <table id="spectra" onDisk="True" adql="True">
          <meta name="table-rank">1</meta>
    

    This will put dfbsspec fairly high up on the root page, and the spectra table above all others in the RD (which have the implicit table rank of 10'000).

    Note that to make DaCHS notice your rank, you need to dachs pub the modified RDs so the ranks end up in DaCHS' dc.resources table; since the Registry does not much care for these ranks, this is a classic use case for the -k option that preserves the registry timestamp of the resource and will thus prevent a re-publication of the registry record (which wouldn't be a disaster either, but let's be good citizens). Ideally, you assign schema ranks to all the resources you care about in one go and then just say:

    dachs pub -k ALL
    

    The Obscore Radio Extension

    While the details are still being discussed, there will be a radio extension to Obscore, and DaCHS 2.10 contains a prototype implementation for the current state of the specification (or my reading of it). Technically, it comprises a few columns useful for, in particular, interferometry data. If you have such data, take a look at https://github.com/ivoa-std/ObsCoreExtensionForRadioData.git and then consider trying what DaCHS has to offer so far; now is the time to intervene if something in the standard is not quite the way it should be (from your perspective).

    The documentation for what to do in DaCHS is a bit scarce yet – in particular, there is no tutorial chapter on obs-radio, nor will there be until the extension has converged a bit more –, but if you know DaCHS' obscore support, you will be immediately at home with the //obs-radio#publish mixin, and you can see it in (very limited) action in the emi/q RD.

    The FITS Media Type

    I have for a long time recommended to use a media type of image/fits for FITS “images” and application/fits for FITS (binary) tables. This was in gross violation of standards: I had freely invented image/fits, and you are not supposed to invent media types without then registering them with the IANA.

    To be honest, the invention was not mine (only). There are applications out there flinging around image/fits types, too, but never mind: It's still bad practice, and DaCHS 2.10 tries to rectify it by first using application/fits even where defaults have been image/fits before, and actually retroactively changing image/fits to application/fits in the database where it can figure out that a column contains a media type.

    It is accepting image/fits as an alias for application/fits in SIAP's FORMAT parameter, and so I hope nothing will break. You may have to adapt a few regression tests, though.

    On the Way To pathlib.Path

    For quite a while, Python has had the pathlib module, which is actually quite nice; for instance, it lets you write dir / name rather than os.path.join(dir, name). I would like to slowly migrate towards Path-s in DaCHS, and thus when you ask DaCHS' configuration system for paths (something like base.getConfig("inputsDir")), you will now get such Path-s.

    Most operator code, however, is still isolated from that change; in particular, the sourceToken you see in grammars mostly remains a string, and I do not expect that to change for the forseeable future. This is mainly because the usual string operations many people to do remove extensions and the like (self.sourceToken[:-5]) will fail rather messily with Path-s:

    >>> n = pathlib.Path("/a/b/c.fits")
    >>> n[:-5]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'PosixPath' object is not subscriptable
    

    So, if you don't call getConfig in any of your DaCHS-facing code, you are probably safe. If you do and get exceptions like this, you know where they come from. The solution, stringification, is rather straightforward:

    >>> str(n)[:-5]
    '/a/b/c'
    

    Partly as a consequence of this, there were slight changes in the way processors work. I hope I have not damaged anyone's code, but if you do custom previews and you overrode classify, you will have to fix your code, as that now takes an accref together with the path to be created.

    Odds And Ends

    As usual, there are many minor improvements and additions in DaCHS. Let me mention security.txt support. This complies to RFC 9116 and is supposed to give folks discovering a vulnerability a halfway reliable way to figure out who to complain to. If you try http://<your-hostname>/.well-known/security.txt, you will see exactly what is in https://dc.g-vo.org/.well-known/security.txt. If this is in conflict with some bone-headed security rules your institution may have, you can replace security.txt in DaCHS' central template directory (most likely /usr/lib/python3/dist-packages/gavo/resources/templates/); but in that case please complain, and we will make this less of a hassle to change or turn off.

    You can no longer use dachs serve start and dachs serve stop on systemd boxes (i.e., almost all modern Linux boxes as configured by default). That is because systemd really likes to manage daemons itself, and it gets cross when DaCHS tries to do it itself.

    Also, it used to be possible to fetch datasets using /getproduct?key=some/accref. This was a remainder of some ancient design mistake, and DaCHS has not produced such links for twelve years. I have now removed DaCHS' ability to fetch accrefs from key parameters (the accrefs have been in the path forever, as in /getproduct/some/accref). I consider it unlikely that someone is bitten by this change, but I personally had to fix two ancient regression tests.

    If you use embedded grammars and so far did not like the error messages because they always said “unknown location“, there is help: just set self.location to some string you want to see when something is wrong with your source. For illustration, when your source token is the name of a text file you process line by line, you would write:

    <iterator><code>
      with open(self.sourceToken) as f:
        for line_no, line in enumerate(f):
          self.location = f"{self.sourceToken}, {line_no}"
          # not do whatever you need to do on line
    </code></iterator>
    

    When regression-testing datalink endpoints, self.datalinkBySemantics may come in handy. This returns a mapping from concept identifiers to lists of matching rows (which often is just one). I have caught myself re-implementing what it does in the tests itself once too often.

    Finally, and also datalink-related, when using the //soda#fromStandardPubDID descriptor generator, you sometimes want to add just an extra attribute or two, and defining a new descriptor generator class for that seems too much work. Well, you can now define a function addExtras(descriptor) in the setup element and mangle the descriptor in whatever way you like.

    For instance, I recently wanted to enrich the descriptor with a few items from the underlying database table, and hence I wrote:

    <descriptorGenerator procDef="//soda#fromStandardPubDID">
      <bind name="accrefPrefix">"dasch/q/"</bind>
      <bind name="contentQualifier">"image"</bind>
      <setup>
        <code>
          def addExtras(descriptor):
            descriptor.suppressAutoLinks = True
            with base.getTableConn() as conn:
              descriptor.extMeta = next(conn.queryToDicts(
                "SELECT * FROM dasch.plates"
                " WHERE obs_publisher_did = %(did)s",
                {"did": descriptor.pubDID}))
        </code>
      </setup>
    </descriptorGenerator>
    

    Upgrade As Convenient

    That's it for the notable changes in DaCHS 2.10. As usual, if you have the GAVO repository enabled, the upgrade will happen as part of your normal Debian apt upgrade. Still, if you have not done so recently, have a quick look at upgrading in the tutorial. If, on the other hand, you use the Debian-distributed DaCHS package and you do not need any of the new features, you can let things sit and enjoy the new features after your next dist-upgrade.

    Oh, by the way: If you are still on buster (or some other distribution that still has astropy 4): A few (from my perspective minor) things will be broken; astropy is evolving too fast, but in general, I am trying to hack around the changes to make DaCHS work at least with the astropys in oldstable, stable, and unstable. However, in cases when a failure seems to be more of an annoyance to, I am resigning. If any of the broken things do bother you, do let me know, but also consider installing a backport of astropy 5 or higher – or, better, to dist-upgrade to bookworm. Sorry about that.

  • Watch Sphinx Doctests

    No astronomy at all here; please move on if tooling for improving tooling bores you.

    While giving a lecture on pyVO, I am churning out quite a few pull requests against pyVO at the moment. I am also normally also fairly religious about running unit tests before doing a commit. But then PyVO unit tests became really, really slow a while ago when pytesting of the examples in the documentation was turned on, and so I started relying on the github continuous integration, which feels fairly wasteful – and also makes all kinds of minor idiocies public that I would have caught locally with a test suite that finishes within a minute or so.

    Regrettably, tooling for inspecting how doctests with sphinx and pytest run is not really great: All the code from one documentation file translates into a single test, and when that runs for five minutes, it's anyone's guess where the time is spent. After a bit of poking and asking around, it seemed to me that there indeed is no “doctest profiler” (if you will), at least not for pytest-executable doctests embedded in sphinx-processable ReStructuredText.

    Well, I thought, let's write a quick one. Originally, I had wanted to use the docutils parser for robustness, but once I tried to pull in the sphinx extensions and got lost in their modules I decided a simple, RE-based parser has to be enough.

    And here it is, my my quick-and-dirty doctest profiler: watch-doctests.py. Just put it into your path, make it executable, and you can do something like this:

    pyvo/docs/dal > watch-doctests.py index.rst | head -30
    ---0.00---------------
    
    import pyvo as vo
    ---0.94---------------
    
    service = vo.dal.SIAService("http://dc.zah.uni-heidelberg.de/lswscans/res/positions/siap/siap.xml")
    ---0.94---------------
    
    print(service.description)
    Scans of plates kept at Landessternwarte Heidelberg-Königstuhl. They
    were obtained at location, at the German-Spanish Astronomical Center
    (Calar Alto Observatory), Spain, and at La Silla, Chile. The plates
    cover a time span between 1880 and 1999.
    
    Specifically, HDAP is essentially complete for the plates taken with
    the Bruce telescope, the Walz reflector, and Wolf's Doppelastrograph
    at both the original location in Heidelberg and its later home on
    Königstuhl.
    ---1.02---------------
    
    import pyvo as vo
    ---1.02---------------
    
    from astropy.coordinates import SkyCoord
    ---1.02---------------
    
    from astropy.units import Quantity
    

    – so, you pass in the ReStructuredText with the embedded sphinx/pytest doctests, and then the thing extracts every line to be executed in the doctests (it ignores the outputs, so it will not actually check any assertions), prints the runtime so far in a separator and then runs the code through Python as usual: note that no automatic repr() of any non-None results – that the REPL does – happens. This is for profiling, not for test development.

    The quick hack helped me speed up the dal and registry doctests by sizeable factors, for instance because I am now avoiding downloads of large datasets, and I am using faster queries where I can.

    So, that's nice. But unless someone asks, I will distribute the code here only and in this ad-hoc fashion (probably with a link in the pyVO hackers' docs). I still believe there must be something a lot less hacky that does about the same thing somewhere out there…

  • Global Dataset Discovery in PyVO

    A Tkinter user interface with inputs for Space, Spectrum, and Time, a checkbox marked "inclusive", and buttons Run, Stop, Broadcast, Save, and Quit.

    Admittedly somewhat old-style: As part of teaching global dataset discovery to pyVO, I have also come up with a Tkinter GUI for it. See A UI for more on this.

    One of the more exciting promises of the Virtual Observatory was global dataset discovery: You say “Give me all spectra of object X that there are“, and the computer relates that request to all the services that might have applicable data. Once the results come in, they are merged into some uniformly browsable form.

    In the early VO, there were a few applications that let you do this; I fondly remember VODesktop. As the VO grew and diversified, however, this became harder and harder, partly because there were more and more services, partly because there were more protocols through which to publish data. Thus, for all I can see, there is, at this point, no software that can actually query all services plausibly serving, say, images or spectra in the VO.

    I have to say that writing such a thing is not for the faint-hearted, either. I probably wouldn't have tackled it myself unless the pyVO maintainers had made it an effective precondition for cleaning up the pyVO Servicetype constraint.

    But they did, and hence as a model I finally wrote some code to do all-VO image searches using all of SIA1, SIA2, and obscore, i.e., the two major versions of the Simple Image Access Protocol plus Obscore tables published through TAP services. I actually have already reported in Tucson on some preparatory work I did last summer and named a few problems:

    • There are too many services to query on a regular basis, but filtering them would require them to declare their coverage; far too many still don't.
    • With the current way of registering obscore tables, there is no way to know their coverage.
    • One dataset may be availble through up to three protocols on a single host.
    • SIA1 does not even let you constrain time and spectrum.

    Some of these problems I can work around, others I can try to fix. Read on to find out how I fared so far.

    The pyVO API

    Currently, the development happens in pyVO PR #470. While it is still a PR, let me point you to temporary pyVO docs on the proposed pyvo.discover module – of course, all of this is for review and probably not in the shape it will remain in[1].

    Followup (2024-11-28)

    With the recent release of pyVO 1.6, what is described here is actually available in the release (or by checking out the main branch of the repository).

    To quote from there, the basic usage would be something like:

    from pyvo import discover
    from astropy import units as u
    from astropy import time
    
    datasets, log = discover.images_globally(
      space=(339.49, 3.1, 0.1),
      spectrum=650*u.nm,
      time=(time.Time('1995-01-01'), time.Time('1995-12-31')))
    

    At this point, only a cone is supported as a space constraint, and only a single point in spectrum. It would certainly be desirable to be more flexible with the space constraint, but given the capabilities of the various protocols, that is hard to do. Actually, even with the plain cone Obscore (i.e., ironically, the most powerful of the discovery protocols covered here) currently results in an implementation that makes me unhappy: ugly, slow, and wrong. This is requires a longer discussion; see Appendix: Optionality Considered Harmful.

    datasets at this point is a list of, conceputally, Obscore records. Technically, the list contains instances of a custom class ImageFound, which have attributes named after the Obscore columns. In case you have doubts about the Semantics of any column, the Obscore specification is there to help. And yes, you can argue we should create a single astropy table from that list. You are probably right.

    PyVO adds an extra column over the mandatory obscore set, origin_service. This contains the IVOA identifier (IVOID) of the service at which the dataset was found. You have probably seen IVOIDs before: they are URIs with a scheme of ivo:. What you may not know: these things actually resolve, specifically to registry resource records. You can do this resolution in a web browser: Just prepend https://dc.g-vo.org/I/ to an IVOID and paste the result into the address bar. For instance, my Obscore table has the IVOID ivo://org.gavo.dc/__system__/obscore/obscore; the link below the IVOID leads you to an information page, which happens to be the resource's Registry record formatted with a bit of XSLT. A somewhat more readable but less informative rendering is available when you prepend https://dc.g-vo.org/LP/ (“landing page”).

    The second value returned from discover.images_globally is a list of strings with information on how the global discovery progressed. For now, this is not intended to be machine-readable. Humans can figure out which resources were skipped because other services already cover their data, which services yielded how many records, and which services failed, for instance:

    Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Obscore GAVO Data Center Obscore Table: 2 records
    SIA2 The VO @ ASTRON SIAP Version 2 Service: 0 records
    SIA2 ivo://au.csiro/casda/sia2 skipped: ReadTimeout: HTTPSConnectionPool(host=&apos;casda.csiro.au&apos;, port=443): Read timed out. (read timeout=20)
    SIA2 CADC Image Search (SIA): 0 records
    SIA2 European HST Archive SIAP service: 0 records
    ...
    

    (On the skipping, see Relationships below). I consider this crucial provenance, as that lets you assess later what you may have missed. When you save the results, be sure to save these, too.

    A feature that will presumably (see Inclusivity for the reasons for this expectation) be important at least for a few years is that you can pass the result of a Registry query, and pyVO will try to find services suitable for image discovery on that set of resources.

    A relatively straightforward use case for that is global obscore discovery. This would look like this:

    from pyvo import discover
    from pyvo import registry
    from astropy import units as u
    from astropy import time
    
    def say(discoverer, s):
            print(s)
    
    datasets, log = discover.images_globally(
      space=(274.6880, -13.7920, 1),
      time=(time.Time('1995-01-01'), time.Time('1995-12-31')),
      services=registry.search(registry.Datamodel("obscore")),
      watcher=say)
    

    The watcher thing lets you, well, watch the progress of the discovery; it receives an instance of the discoverer -- this is so you can abort a discoverer's activities from within some UI -- and the human-readable string to display or process in some other way.

    A UI

    To get an idea whether this API might one day work for the average astronomer, I have written a Tkinter-based GUI to global image discovery as it is now: tkdiscover (only available from github at this point). This is what a session with it might look like:

    Lots of TOPCAT windows with various graphs and tables, an x-ray image of the sky with overplotted points, and a play gray window offering the specification of space, spectrum, and time constraints.

    The actual UI is in the top right: A plain window in which you can configure a global discovery query by straightfoward serialisations of discover.images_globally's arguments:

    • Space (currently, a cone in RA, Dec, and search radius, separated by whitespace of commas)
    • Spectrum (currently, a single point as a wavelength in metres)
    • Time (currently, either a single point in time – which probably is rarely useful – or an interval, to be entered as civil DALI dates
    • Inclusivity.

    When you run this, this basically calls discover.images_globally and lets you know how it is progressing. You can click Broadcast (which sends the current result to all VOTable clients on the SAMP bus) or Save at any time and inspect how discovery is progressing. I predict you will want to do that, because querying dozens of services will take time.

    There is also a Stop button that aborts the dataset search (you will still have the records already found). Note that the Stop button will not interrupt running network operations, because the network library underneath pyVO, requests, is not designed for being interrupted. Hence, be patient when you hit stop; this may take as long as the configured timeout (currently is 20 seconds) if the service hangs or has to do a lot of work. You can see that tkdiscover has noticed your stop request because the service counter will show a leading zero.

    Service counter? Oh, that's what is at the bottom right of the window. Once service discovery is done, that contains three numbers: The number of services to query, the number of services queried already, and the number of services that failed.

    The table contains the obscore records described above, and the log lines are in the discovery_log INFO. I will give you that this is extremely unreadable in particular in TOPCAT, which normalises the line separators to plain whitespace. Perhaps some other representation of these log lines would be preferable: A PARAM with a char[][] (but VOTable still is terrible with arrays of variable-length strings)? Or a separate table with char[*] entries?

    Inclusivity

    I have promised above I'd explain the “Inclusive” part in both the pyVO API and the Tk UI. Well, this is a bit of a sad story.

    All-VO-queries take time. Thus, in pyVO we try to only query services that we expect serve data of interest. How do we arrive at expectations like that? Well, quite a few records in the Registry by now declare their coverage in space and time (cf. my 2018 post for details).

    The trouble is: Most still don't. The checkmark at inclusive decides whether or not to query these “undecidable” services. Which makes a huge difference in runtime and effort. With the pre-configured constraints in the current prototype (X-Ray images a degree around 274.6880, -13.7920 from the year 1995), we currently discover three services (of which only one actually needs to be queried) when inclusive is off. When it is on, pyVO will query a whopping 323 services (today).

    The inclusivity crisis is particularly bad with Obscore tables because of their broken registration pattern; I can say that so bluntly because I am the author of the standard at fault, TAPRegExt. I am preparing a note with a longer explanation and proposals for fixing matters – <cough> follow me on github –, but in all brevity: Obscore data is discovered using something like a flag on TAP services. That is bad because the TAP services usually have entriely different metadata from their Obscore table; think, in particular, of the physical coverage that is relevant here.

    It will be quite a bit of effort to get the data providers to do the Registry work required to improve this situation. Until that is done, you will miss Obscore tables when you don't check inclusive (or override automatic resource selection as above) – and if you do check inclusive, your discovery runs will take something like a quarter of an hour.

    Relationships

    In general, the sheer number of services to query is the Achilles' heel in the whole plan. There is nothing wrong with having a machine query 20 services, but querying 200 is starting to become an effort.

    With multi-data collection services like Obscore (or collective SIA2 services), getting down to a few dozen services globally for a well-constrained search is actually not unrealistic; once all resources properly declare their coverage, it is not very likely that more than 20 institutions worldwide will have data in a credibly small region of space, time, and spectrum. If all these run collective services and properly declare the datasets to be served by them, that's our 20-services global query right there.

    However, pyVO has to know when data contained in a resource is actually queriable by a collective service. Fortunately, this problem has already been addressed in the 2019 endorsed note on Discovering Data Collections Within Services: Basically, the individual resource declares an IsServedBy relationship to the collective service. PyVO global discovery already looks at these. That is how it could figure out these two things in the sample log given above:

    Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    

    But of course the individual services have to declare these relationships. Surprisingly many already do, as you can observe yourself when you run:

    select ivoid, related_id from
    rr.relationship
    natural join rr.capability
    where
    standard_id like 'ivo://ivoa.net/std/sia%'
    and relationship_type='isservedby'
    

    on your favourite RegTAP endpoint (if you have no preferences, use mine: http://dc.g-vo.org/tap). If you have collective services and run individual SIA services, too, please run that query, see if you are in there, and if not, please declare the necessary relationships. In case you are unsure as to what to do, feel free to contact me.

    Future Directions

    At this point, this is a rather rough prototype that needs a lot of fleshing out. I am posting this in part to invite the more adventurous to try (and break) global discovery and develop further ideas.

    Some extensions I am already envisaging include:

    • Write a similar module for spectra based on SSAP and Obscore. That would then probably also work for time series and similar 1D data.

    • Do all the Registry work I was just talking about.

    • Allow interval-valued spectral constraints. That's pretty straightforward; if you are looking for some place to contribute code, this is what I'd point you to.

    • Track overflow conditions. That should also be simple, probably just a matter of perusing the pyVO docs or source code and then conditionally produce a log entry.

    • Make an obscore s_region out of the SIA1 WCS information. This should also be easy – perhaps someone already has code for that that's tested around the poles and across the stitching line? Contributions are welcome.

    • Allow more complex geometries to define the spatial region of interest. To keep SIA1 viable in that scenario it would be conceivable to compute a bounding box for SIA1 POS/SIZE and do “exact” matching locally on the coarser SIA1 result.

    • Enable multi-position or multi-interval constraints. This pretty certainly would exclude SIA1, and, realistically, I'd probably only enable Obscore services with TAP uploads with this. With those constraints, it would be rather straightforward.

    • Add SODA support: It would be cool if my ImageFound had a way to say “retrieve data for my RoI only”. This would use SODA and datalink to do server-side cutouts where available and do the cut-out locally otherwise. If this sounds like rocket science: No, the standards for that are actually in place, and pyVO also has the necessary support code. But still the plumbing is somewhat tricky, partly also because pyVO's datalink API still is a bit clunky.

    • Going async? Right now, we civilly query one service after the other, waiting for each result before proceeding to the next service. This is rather in line with how pyVO is written so far.

      However, on the network side for many years asynchronous programming has been a very successful paradigm – for instance, our DaCHS package has been based on an async framework from the start, and Python itself has growing in-language support for async, too.

      Async allows you to you fire off a network request and forget about it until the results come back (yes, it's the principle of async TAP, too). That would let people run many queries in parallel, which in turn would result in dramatically reduced waiting times, while we can rather easily ensure that a single client will not overflow any server. Still, it would be handing a fairly powerful tool into possibly unexperienced hands… Well: for now there is no need to decide on this, as pyVO would need rather substantial upgrades to support async.

    Appendix: Optionality Considered Harmful

    The trouble with obscore and cones is a good illustration of the traps of attempting to fix problems by adding optional features. I currently translate the cone constraint on Obscore using:

    "(distance(s_ra, s_dec, {}, {}) < {}".format(
      self.center[0], self.center[1], self.radius)
    +" or 1=intersects(circle({}, {}, {}), s_region))".format(
      self.center[0], self.center[1], self.radius))
    

    which is all of ugly, presumably slow, and wrong.

    To appreciate what is going on, you need to know that Obscore has two ways to define the spatial coverage of an observation. You can give its “center” (s_ra, s_dec) and something like a rough radius (s_fov), or you can give some sort of geometry (e.g., a polygon: s_region). When the standard was written, the authors wanted to enable Obscore services even on databases that do not know about spherical geometry, and hence s_region is considered rather optional. In consequence, it is missing in many services. And even the s_ra, s_dec, s_fov combo is not mandatory non-null, so you are perfectly entitled to only give s_region.

    That is why there are the two conditions or-ed together (ugly) in the code fragment above. 1=intersects(circle(.), s_region) is the correct part; this is basically how the cone is interpreted in SIA1, too. But because s_region may be NULL even when s_ra and s_dec are given, we also need to do a test based on the center position and the field of view. That rather likely makes things slower, possibly quite a bit.

    Even worse, the distance-based condition actually is wrong. What I really ought to take into account is s_fov and then do something like distance(.) < {self.radius}+s_fov, that is, the dataset position need only be closer than the cone radius plus the dataset's FoV (“intersects”). But that would again produce a lot of false negatives because s_fov may be NULL, too, and often is, after which the whole condition would be false.

    On top of that, it is virtually impossible that such an expression would be evaluated using an index, and hence with this code in place, we would likely be seqscanning the entire obscore table almost every time – which really hurts when you have about 85 Million records in your Obscore table (as I do).

    The standard could immediately have sanitised all this by saying: when you have s_ra and s_dec, you must also give a non-empty s_fov and s_region. This is a classic case for where a MUST would have been necessary to produce something that is usable without jumping through hoops. See my post on Requirements and Validators on this blog for a longer exposition on this whole matter.

    I'm not sure if there is a better solution than the current “if the operators didn't bother with s_region, the dataset's FoV will be ignored“. If you have good ideas, by all means let me know.

    Followup (2024-11-28)

    I've given a talk at the Malta interop giving another view on this matter: VO at the limit.

    [1]

    If you want to try this (in particular without clobbering your “normal” pyVO), do something like this:

    virtualenv --system-site-packages global-datasets
    . global-datasets/bin/activate
    cd global-datasets
    git clone https://github.com/msdemlei/pyvo
    cd pyvo
    git checkout global-datasets
    pip install .
    

Page 1 / 5 »