Posts with the Tag Semantics:

  • A New Constraint Class in PyVO's Registry API: UAT

    A scan of a book page: lots of astronomy-relevant topics ranging from "Cronometrie" to "Kosmologie, Relativitätstheorie".  Overlaid a title page stating "Astronomischer Jahresbericht.  Die Literatur des Jahres 1967".

    This was how they did what I am talking about here almost 60 years ago: a page of the table of contents of the “Astronomischer Jahresbericht” for 1967, the last volume before it was turned into the English-language Astronomy and Astrophysics Abstracts, which were the main tool for literature work in astronomy until the ADS came along in the late 1990ies.

    I have recently created a pull request against pyVO to furnish the library with a new constraint to search for data and services: Search by a concept drawn from the Unified Astronomy Thesaurus UAT. This is not entirely different from the classical search by subject keywords that was what everyone did before we had the ADS, which is what I am trying to illustrate above. But it has some twists that, I would argue, still make it valuable even in the age of full-text indexes.

    To make my argument, let me first set the stage.

    Thesauri and the UAT

    (Disclaimer: I am currently a member of the UAT steering committee and therefore cannot claim neutrality. However, I would not claim neutrality otherwise, either: the UAT is not perfect, but it's already great)

    Librarians (and I am one at heart) love thesauri. Or taxonomies. Or perhaps even ontologies. What may sound like things out of a Harry Potter novel are actually ways to organise a part of the world (a “domain”) into “concepts”. If you are suitably minded, you can think of a “concept“ as a subset of the domain; “suitably minded“ here means that you consider the world as a large set of things and a domain a subset of this world. The IVOA Vocabularies specification contains some additional philosophical background on this way of thinking in sect. 5.2.4.

    On this other hand, if you are not suitably minded, a “concept” is not much different from a topic.

    There are differences in how each of thesaurus, taxonomy, and ontology does that organising (and people don't always agree on the differences). Ontologies, for instance, let you link concepts in every way, as in “a (bicycle) (is steered) (using) a (handle bar) (made of) ((steel) or (aluminum))“; every parenthesised phrase would be a node (which is a better term in ontologies than “concept”) in a suitably general ontology, and connecting these nodes creates a fine-graned representation of knowledge about the world.

    That is potentially extremely powerful, but also almost too hard for humans. Check out WordNet for how far one can take ontologies if very many very smart people spend very many years.

    Thesauri, on the other hand, are not as powerful, but they are simpler and within reach for mere humans: there, concepts are simply organised into something like a tree, perhaps (and that is what many people would call a taxonomy) using is-a relationships: A human is a primate is a mammal is a vertebrate is an animal. The UAT actually is using somewhat vaguer notions called “narrower” and “wider”. This lets you state useful if somewhat loose relationships like “asteroid-rotation is narrower than asteroid-dynamics”. For experts: The UAT is using a formalism called SKOS; but don't worry if you can't seem to care.

    The UAT is standing on the shoulders of giants: Before it, there has been the IAU thesaurus in 1993, and an astronomy thesaurus was also produced under the auspices of the IVOA. And then there were (and to some extent still are) the numerous keyword schemes designed by journal publishers that would also count as some sort of taxonomy or astronomy.

    “Numerous” is not good when people have to assign keywords to their journal articles: If A&A use something drastically or only subtly different from ApJ, and MNRAS still something else, people submitting to multiple journals will quite likely lose their patience and diligence with the keywords. For reasons I will discuss in a second, that is a shame.

    Therefore, at least the big American journals have now all switched to using UAT keywords, and I sincerely hope that their international counterparts will follow their example where that has not already happened.

    Why Keywords?

    Of course, you can argue that when you can do full-text searches, why would you even bother with controlled keyword lists? Against that, I would first argue that it is extremely useful to have a clear idea of what a thing is called: For example, is it delta Cephei stars, Cepheids, δ Cep stars or still something else? Full text search would need to be rather smart to be able to sort out terminological turmoil of this kind for you.

    And then you would still not know if W Virginis stars (or should you say “Type II Cepheids”? You see how useful proper terminology is) are included in whatever your author called Cepheids (or whatever they called it). Defining concepts as precisely as possible thus is already great.

    The keyword system becomes even more useful when the hiearchy we see in the Cepheid example becomes visible to computers. If a computer knows that there is some relationship between W Virgins stars and classical Cepheids, it can, for instance, expand or refine your queries (“give me data for all kinds of Cepheids”) as necessary. To give you an idea of how this looks in practice, here is how SemBaReBro displays the Cepheid area in the UAT:

    Arrows between texts like "Type II Cepheid variable stars", "Cepheid variable stars", and "Young disk Cepheid variable stars"

    In that image, only concepts associated with resources in the Registry have a spiffy IVOA logo; that so few VO resources claim to deal with Cepheids tells you that our data providers can probably improve their annotations quite a bit. But that is for another day; the hope is that as more people search using UAT concepts, the data providers will see a larger benefit in choosing them wisely[1].

    By the way, if you are a regular around here, you will have seen images like that before; I have talked about Sembarebro in 2021 already, and that post contains more reasons for having and maintaining vocabularies.

    Oh, and for the definitions of the concepts, you can (in general; in the UAT, there are still a few concepts without definitions) dereference the concept URI, which in the VO is always of the form <vocabulary uri>#<term identifier>, where the vocabulary URI starts with http://www.ivoa.net/rdf, after which there is the vocabulary name.

    Thus, if you point your web browser to https://www.ivoa.net/rdf/uat#cepheid-variable-stars[2], you will learn that a Cepheid is:

    A class of luminous, yellow supergiants that are pulsating variables and whose period of variation is a function of their luminosity. These stars expand and contract at extremely regular periods, in the range 1-50 days [...]

    The UAT constraint

    Remember? This was supposed to be a blog post about a new search constraint in pyVO. Well, after all the preliminaries I can finally reveal that once pyVO PR #649 is merged, you can search by UAT concepts:

    >>> from pyvo import registry
    >>> print(registry.search(registry.UAT("variable-stars")))
    <DALResultsTable length=2010>
                  ivoid               ...
                                      ...
                  object              ...
    --------------------------------- ...
             ivo://cds.vizier/b/corot ...
              ivo://cds.vizier/b/gcvs ...
               ivo://cds.vizier/b/vsx ...
              ivo://cds.vizier/i/280b ...
               ivo://cds.vizier/i/345 ...
               ivo://cds.vizier/i/350 ...
                                  ... ...
                ivo://cds.vizier/v/97 ...
             ivo://cds.vizier/vii/293 ...
       ivo://org.gavo.dc/apass/q/cone ...
    ivo://org.gavo.dc/bgds/l/meanphot ...
         ivo://org.gavo.dc/bgds/l/ssa ...
         ivo://org.gavo.dc/bgds/q/sia ...
    

    In case you have never used pyVO's Registry API before, you may want to skim my post on that topic before continuing.

    Since the default keyword search also queries RegTAP's res_subject table (which is what this constraint is based on), this is perhaps not too exciting. At least there is a built-in protection against typos:

    >>> print(registry.search(registry.UAT("varialbe-stars")))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/msdemlei/gavo/src/pyvo/pyvo/registry/rtcons.py", line 713, in __init__
        raise dalq.DALQueryError(
    pyvo.dal.exceptions.DALQueryError: varialbe-stars does not identify an IVOA uat concept (see http://www.ivoa.net/rdf/uat).
    

    It becomes more exciting when you start exploiting the intrinsic hierarchy; the constraint constructor supports optional keyword arguments expand_up and expand_down, giving the number of levels of parent and child concepts to include. For instance, to discover resources talking about any sort of supernova, you would say:

    >>> print(registry.search(registry.UAT("supernovae", expand_down=10)))
    <DALResultsTable length=593>
                     ivoid                   ...
                                             ...
                     object                  ...
    ---------------------------------------- ...
                       ivo://cds.vizier/b/sn ...
                     ivo://cds.vizier/ii/159 ...
                     ivo://cds.vizier/ii/189 ...
                     ivo://cds.vizier/ii/205 ...
                    ivo://cds.vizier/ii/214a ...
                     ivo://cds.vizier/ii/218 ...
                                         ... ...
               ivo://cds.vizier/j/pasp/122/1 ...
           ivo://cds.vizier/j/pasp/131/a4002 ...
               ivo://cds.vizier/j/pazh/30/37 ...
              ivo://cds.vizier/j/pazh/37/837 ...
    ivo://edu.gavo.org/eurovo/aida_snconfirm ...
                    ivo://mast.stsci/candels ...
    

    There is no overwhelming magic in this, as you can see when you tell pyVO to show you the query it actually runs:

    >>> print(registry.get_RegTAP_query(registry.UAT("supernovae", expand_down=10)))
    SELECT
      [crazy stuff elided]
    WHERE
    (ivoid IN (SELECT DISTINCT ivoid FROM rr.res_subject WHERE res_subject in (
      'core-collapse-supernovae', 'hypernovae', 'supernovae',
      'type-ia-supernovae', 'type-ib-supernovae', 'type-ic-supernovae',
      'type-ii-supernovae')))
    GROUP BY [whatever]
    

    Incidentally, some services have an ADQL extension (a “user defined function“ or UDF) that lets you do these kinds of things on the server side; that is particularly nice when you do not have the power of Python at your fingertips, as for instance interactively in TOPCAT. This UDF is:

    gavo_vocmatch(vocname STRING, term STRING, matchagainst STRING) -> INTEGER
    

    (documentation at the GAVO data centre). There are technical differences, some of which I try to explain in amoment. But if you run something like:

    SELECT ivoid FROM rr.res_subject
    WHERE 1=gavo_vocmatch('uat', 'supernovae', res_subject)
    

    on the TAP service at http://dc.g-vo.org/tap, you will get what you would get with registry.UAT("supernovae", expand_down=1). That UDF also works with other vocabularies. I particularly like the combination of product-type, obscore, and gavo_vocmatch.

    If you wonder why gavo_vocmatch does not go on expanding towards narrower concepts as far as it can go: That is because what pyVO does is semantically somewhat questionable.

    You see, SKOS' notions of what is wider and narrower are not transitive. This means that just because A is wider than B and B is wider than C it is not certain that A is wider than C. In the UAT, this sometimes leads to odd results when you follow a branch of concepts toward narrower concepts, mostly because narrower sometimes means part-of (“Meronymy”) and sometimes is-a (“Hyponymy“). Here is an example discovered by my colleague Adrian Lucy:

    interstellar-medium wider nebulae wider emission-nebulae wider planetary-nebulae wider planetary-nebulae-nuclei

    Certainly, nobody would argue that that the central stars of planetary nebulae somehow are a sort of or are part of the interstellar medium, although each individual relationship in that chain makes sense as such.

    Since SKOS relationships are not transitive, gavo_vocmatch, being a general tool, has to stop at one level of expansion. By the way, it will not do that for the other flavours of IVOA vocabularies, which have other (transitive) notions of narrower-ness. With the UAT constraint, I have fewer scruples, in particular since the expansion depth is under user control.

    Implementation

    Talking about technicalities, let me use this opportunity to invite you to contribute your own Registry constraints to pyVO. They are not particularly hard to write if you know both ADQL and Python. You will find several examples – between trivial and service-sensing complex in pyvo.registry.rtcons. The code for UAT looks like this (documentation removed for clarity[3]):

    class UAT(SubqueriedConstraint):
        _keyword = "uat"
        _subquery_table = "rr.res_subject"
        _condition = "res_subject in {query_terms}"
        _uat = None
    
        @classmethod
        def _expand(cls, term, level, direction):
            result = {term}
            new_concepts = cls._uat[term][direction]
            if level:
                for concept in new_concepts:
                    result |= cls._expand(concept, level-1, direction)
            return result
    
        def __init__(self, uat_keyword, *, expand_up=0, expand_down=0):
            if self.__class__._uat is None:
                self.__class__._uat = vocabularies.get_vocabulary("uat")["terms"]
    
            if uat_keyword not in self._uat:
                raise dalq.DALQueryError(
                    f"{uat_keyword} does not identify an IVOA uat"
                    " concept (see http://www.ivoa.net/rdf/uat).")
    
            query_terms = {uat_keyword}
            if expand_up:
                query_terms |= self._expand(uat_keyword, expand_up, "wider")
            if expand_down:
                query_terms |= self._expand(uat_keyword, expand_down, "narrower")
    
            self._fillers = {"query_terms": query_terms}
    

    Let me briefly describe what is going on here. First, we inherit from the base class SubqueriedConstraint. This is a class that takes care that your constraints are nicely encapsulated in a subquery, which generally is what you want in pyVO. Calmly adding natural joins as recommended by the RegTAP specification is a dangerous thing for pyVO because as soon as a resource matches your constraint more than once (think “columns with a given UCD”), the RegistryResult lists in pyVO will turn funny.

    To make a concrete SubqueriedConstraint, you have to fill out:

    • the table it will operate on, which is in the _subquery_table class attribute;
    • an expression suitable for a WHERE clause in the _condition attribute, which is a template for str.format. This is often computed in the constructor, but here it is just a constant expression and thus works fine as a class attribute;
    • a mapping _fillers mapping the substitutions in the _condition string template to Python values. PyVO's RegTAP machinery will worry about making SQL literals out of these, so feel free to just dump Python values in there. See the make_SQL_literal for what kinds of types it understands and expand it as necessary.

    There is an extra class attribute called _keyword. This is used by the pyvo.regtap machinery to let users say, for instance, registry.search(uat="foo.bar") instead of registry.search(registry.UAT("foo.bar")). This is a fairly popular shortcut when your constraints can be expressed as simple strings, but in the case of the UAT constraint you would be missing out on all the interesting functionality (viz., the query expansion that is only available through optional arguments to its constructor).

    This particular class has some extra logic. For one, we cache a copy of the UAT terms on first use at the class level. That is not critical for performance because caching already happens at the level of get_vocabulary; but it is convenient when we want query expansion in a class method, which in turn to me feels right because the expansion does not depend on the instance. If you don't grok the __class__ magic, don't worry. It's a nerd thing.

    More interesting is what happens in the _expand class method. This takes the term to expand, the number of levels to go, and whether to go up or down in the concept trees (which are of the computer science sort, i.e., with the root at the top) in the direction argument, which can be wider or narrower, following the names of properties in Desise, the format we get our vocabulary in. To learn more about Desise, see section 3.2 of Vocabularies in the VO 2.

    At each level, the method now collects the wider or narrower terms, and if there are still levels to include, calls itself on each new term, just with level reduced by one. I consider this a particularly natural application of recursion. Finally. everything coming back is merged into a set, which then is the return value.

    And that's really it. Come on: write your own RegTAP constraints, and also have fun with vocabularies. As you see here, it's really not that magic.

    [1]Also, just so you don't leave with the impression I don't believe in AI tech at all, something like SciX's KAILAS might also help improving Registry subject keywords.
    [2]Yes, in a little sleight of hand, I've switched the URI scheme to https here. That's not really right, because the term URIs are supposed to be opaque, but some browsers currently forget the fragment identifiers when the IVOA web server redirects them to https, and so https is safer for this demonstration. This is a good example of why the web would be a better place if http had been evolved to support transparent, client-controlled encryption (rather than inventing https).
    [3]I've always wanted to write this.
  • DaCHS 2.5: Check your UCDs

    DaCHS logo on top of a map of UCDs

    In the background of the DaCHS 2.5 release picture: UCDs grabbed from the Registry. The factual background: DaCHS 2.5 will now moan at you when you invent or mistype UCDs

    This afternoon, I have released DaCHS 2.5. As usual, I will discuss the more important changes in a blog post – this one.

    A change many of you will not like too much is that DaCHS now validates UCDs you give it, and it will warn you when you do not follow the UCD rules. This may seem like nit-picking, but as blind discovery is on the verge of becoming usable in the VO, making sure these strings actually are what they should be is becoming operationally important: If I want to find resources that give errors for their photometry, I have to know whether it's stat.error;phot.mag.b or phot.mag.b;stat.error, or else I will miss half the resources out there.

    So, I'm sorry if DaCHS starts complaining about half of your RDs after you update, but it's for a good cause. And don't feel bad about the complaints: DaCHS complained about close to half of my RDs after I had put in that feature.

    By the way, this comes as part of a larger effort on the side of the Operations IG to improve the validity of UCDs and units in the VO, an effort that has unearthed bugs in the SSAP and SLAP specifications in that they require UCDs forbidden by the UCD standard. DaCHS 2.5 still follows SSAP and SLAP, and hence external tools like stilts will protest because of bad UCDs even if DaCHS is happy. Errata for the specifications are being worked on, and once they are accepted, DaCHS and stilts will finally agree on UCD validity, or so I hope.

    Code-wise, a much more intrusive change was that asynchronous services (in particular, async TAP) now use the same formalism for parsing parameters as their synchronous counterparts. It may seem odd that that hasn't been the case up to now, but there were good reasons for that; for instance, with async, people can post incomplete parameter sets that would be rejected by normal sync processing.

    Unless you are running User UWS services, you should not notice anything. If you do run User UWS services, please contact me before upgrading. I would like to work with you on how these should look like in the future.

    Another change that might break your services is that DaCHS now actually complies to VOUnits, which has always forbidden whitespace of all kinds in unit strings. DaCHS, on the other hand, has foolishly encouraged putting whitespace between scale factors and pure units, as in 1e-10 m. That's not interoperable, and hence DaCHS now rejects such units. This may lead to hidden failures when dachs val doesn't notice something is a unit, and things only break during execution. I'm aware of one place where that's relevant: spectral cutout services that need to know the spectral unit If you're running those, make double sure that the spectralUnit in the SSAP mixin does not contain any whitespace. It's 0.1nm according to VOUnits, not 0.1 nm.

    An update that should silently make your services more compliant is that DaCHS' representation of EPN-TAP is updated to what is currently under IVOA review. After you upgrade, DaCHS will try to update your EPN tables' metadata, which in turn should make stilts taplint a lot happier. It will also make DaCHS pass on the new, IVOA table utype to the Registry, which is how people should in the future find EPN-TAP data.

    DaCHS now also contains some code that may help you import data from HDF5 files. For one, there is the HDF5 grammar, which rather directly pulls data from HDF5s written by astropy or vaex. But, really: HDF5 is a rather low-level format not particularly well suited for relational data, and it is virtually impossible to write generic code for doing something sensible with it. The two flavours DaCHS supports have very little in common, and it is therefore almost certain that if you have HDF5s coming from somewhere else, hdf5Grammar will not understand them. Still, let us know what you've got, we may be able to put support for it in.

    Hdf5grammar is written in Python, and thus imports perhaps a few thousand rows per second. For Gigarow-sized data collections, that's nowhere near fast enough, and hence for vaex-written HDF5s, there is booster support. As before, if you have bulk data in HDF5 that you want to put into a database and that was not written by vaex, let us know and we'll see what we can do.

    A surprisingly minor change enabled DaCHS to deal with materialised views, database views that are turned into actual tables by postgres. See the corresponding section in the tutorial for how you can use them. We do not have any materialised views in our Heidelberg data center yet. So, if you use them and notice something is clunky, your feedback is particularly appreciated.

    There are many smaller changes and improvements; let me mention what the changelog euphemistically calls ”better systemd integration”, which really means that so far systemctl restart dachs simply didn't do anything at all. Apologies. And shame on everyone who was bewildered but failed to report this to dachs-support.

    Also, you can use float arrays in boosters now, and DaCHS' ADQL has just leared about COALESCE. That's a SQL feature that lets you deal sensibly with NULLs in some cases: COALESCE(arg1, arg2, ...) will return the first non-NULL argument it encounters. That may sound like a slightly exotic function. Until you need it, at which point you wonder how ADQL could reach its ripe age without COALESCE.

    Finally, let me mention something that is not part of the release, though it is DaCHS-related and is new since the last release: I have cleaned up the access log processing machinery we have used in Heidelberg in the past 15 years or so, and I have packaged it up for general consumption. It is, of course, a DaCHS RD that you can just check out and use in your own DaCHS installation if you have to keep access logs and want to do that with at least some basic respect for your user's rights. See http://docs.g-vo.org/DaCHS/tutorial.html#access-logs for details.

  • The 2021 Southern Spring Interop

    A Venn diagram of product types that just doesn't work.

    A contribution for the ”things that didn't work out” (“Arbeiten, die zu keiner Lösung geführt haben”) section in our reports to BMBF: an attempt to systematise product types at the last Interop. I've made a new proposal at this Interop, and there is reason to hope it will fare better.

    Last night, the second IVOA Interop conference of 2021 came to an end; I'm calling it ”southern spring” because notionally, it happened in Cape Town, back to back with this year's ADASS. In reality, it was again an online event, and so, in keeping up with the tradition established in the pandemic times, the closing event was around midnight CET. I cannot say I will miss these late-night events, although I would not go as far as some people at the conference who quipped they'd prefer the airport security checks to having to sit through another zoom marathon.

    My contributions at this interop again had a clear focus on semantics, for instance with my public confession that my attempt to systematise “product types” at the last interop was entirely misguided; trying to force concepts like “time series”, “spectrum“ or “image” into a tree does not lead to anything that actually works for what this is intended to do, that is, helping people find the sort of data they are after for a particular purpose, or helping clients route data products to other clients better suited to process them. I will now try a restart using SKOS, a plan that was met with a lot more agreement than that previous attempt. Some entertainment at the side was provided by the realisation that a “time-image cube“ is normally called a movie. Next time I'll take in moving pictures, I'll find out what people say when I claim to investigate a time cube.

    Another talk that took up a topic from the last Interop's Semantics session was about making an IVOA vocabulary of object types based on the work done within the CDS over the last 40 year or so. This certainly is just the beginning of a longer effort, not the least because the current concepts severely fall short in the area of the solar system. But it's a start, and there's plenty of time to elaborate this before it will go through a review, presumably with the next version of Obscore.

    Also semantics-related, but over in the session of the Operations interest group, Mark Taylor reported on his activities to evaluate the standards adherence of semantics information in published tables. This activity is what had triggered me to make DaCHS validate UCDs assigned to columns in summer, something that I expect will result in quite few diagnostics when DaCHS operators upgrade to DaCHS 2.5 (expected for November). But that's fine: making it more likely that computers will actually recognise a, say, error in proper motion for what it is is undoubtedly a good thing. I'm therefore glad that there is almost a million “good” UCDs out there and a lot fewer somehow “bad”. I had expected much worse after my realisation that my own annotations left a lot to be desired in summer. By now, the only bad UCDs I'm still pushing out are the ones mandated by SSAP and SLAP. The contradictions between those standards and UCD are going to be addressed with Errata in the coming months.

    My talk in the third Apps session on Thursday afternoon still had some relationship with Semantics; it was a quick show and tell on the enhancements to WIRR I had reported on here in July, and it in particular showcased obtaining UCD constraints by full-text searching the rr.table_column table in my RegTAP service and the selection through UAT concepts. Satisfyingly in some way, it were these topics that people took up in the discussion after the talk. Less satisfyingly, people playing with the thing afterwards turned up something that has the alarming taste of a bug in the new MOC operations in pgsphere. Ouch.

    This segues into the realm of Registry, where there was no actual session but a rather well-attended side meeting in the gathertown instance we could take over from ADASS (that, incidentally, was substantially better attended than during the previous meetings). There, I mainly presented (and explained) my proposed changes to pyVO's registry interface currently living in a private branch in my fork on github. I will write a bit more on that around the time I will turn that into a PR.

    Another outcome of this was that there was some interest to turn the note on documents in the Registry – which is what feeds VOTT – into either an endorsed note or perhaps a Recommendation of the Registry WG.

    My fourth “proper” (in the rather twisted sense of: in a zoom session) talk was an attempt to finally do something about the problems pointed out in my caproles note lamenting that our current service registration patterns are fundamentally flawed. It proposed some ways to to get VOSI availability fixed, and the outcome was that we probably will drop what we currently require in that field, not the least because these requirements are cheerily ignored by 98% of the resources in the Registry.

    Those were again three fairly long days, usually starting with sessions around 7:00 CET and ending with sessions around midnight. Which is clearly not healthy. But on the other hand, it somehow does convey a physical sense of the global nature of the Virtual Observatory, on which people in many, many time zones work. And that, I have to say, still is something I do appreciate.

  • GAVO at the Northern Spring Interop 2021

    As usual in May, the people making the Virtual Observatory happen meet for their Interoperability Conference, better known as the Interop – where “meet” still has to be taken with a generous helping of salt (more on this near the end of this post). As has become customary on this blog, let me briefly discuss contributions with a significant involvement of GAVO.

    A major thing from my perspective actually happened in the run-up: The IVOA executive committee (“Exec“) approved Version 2.0 of Vocabularies in the VO, a standard saying how hierarchical word lists (“vocabularies“) can be managed, disseminated, and consumed within the VO. Developing the main ideas from sufficiently restricting RDF to coming up with desise (which makes complicated things possible with surprisingly little code), and trying things out on our growing number of vocabularies took up quite a bit of my standards time in the last 20 months or so – and I'm fairly happy with the outcome, which I celebrated with a brief talk on programming with IVOA semantics during Wednesday morning's semantics session.

    In that session I gave a second, more discussion-oriented, talk, probing how to formalise data product types – which is surprisingly involved, even with the relatively straightforward use case “figure out a programme to handle the data“: What's a spectrum? Well, something that maps a spectral coordinate to... hm. Is it still a spectrum if there's multiple sorts values (perhaps flux, magnitude, and polarisation)? If we allow, in effect, tuples, why not whole images, which would make spectral cubes spectra – but of course few client programmes that deal with spectra do anything useful with cubes, so clearly such a definition would kill our use case. And what about slit spectra, mapping a spatial coordinat to spectra?

    All this of course is reminiscent of the classical problems of semantics: An elephant is a big animal with a trunk. But when an elephant loses its trunk in an accident: does it stop being an elephant? So, much of the art here is finding the sweet spot of usability between strict and formal semantics (that will never fit the real world) and just tossing around loosely defined strings (that will simply not be machine-readable). After the session, I came up with the 2021-05-26 draft of product-type. If you read this a few years down the road, it might be interesting to compare with what product-type is today. I'm curious myself.

    Later on Wednesday CET, I did a shameless plug for my Datalink-transforming XSLT (apologies for a github link, but I'm fishing for PRs here; if you use DaCHS, you'll get the updated stuff with version 2.4, due soon). The core of this dates back to the dawn of datalink, but with a new graphical cutout code and in particular vocabulary-based tree-ification of the result rows, I figured it's time to remind the operators of datalink services it's still out there for them to take up. Perhaps more than from the slides, you can see what I am after here by just trying the Datalink examples I've collected for this talk and comparing document source, the appearance without Javascript (pure XSLT) and the appearance with Javascript (I'm a bit ashamed I'm relying so heavily on it, but much of this really can only be done client-side).

    Quite a bit after midnight my time (still Thursday UTC), Mark Taylor talked about Software Identification, something I've been working on with him recently. It's is one of the things that is short and trivial but that, when unregulated, just doesn't work; in this case it's servers and clients saying what they are when they speak HTTP. I stumbled into the problem while trying to locate severely outdated DaCHS installations – so, I a way I put effort into the Note Mark was talking about (and which I have just uploaded to the IVOA Document Repository) as a sort of penance.

    While I was already asleep when Mark gave his talk, I was back at the Interop Friday morning CEST, when Hendrik Heinl talked about the LOFAR TAP service (which, I'm proud to say, runs on top of DaCHS); this was mainly live operations in TOPCAT (which is why there's no exciting slides), but Hendrik used a pyVO script doing cutouts in an (optical) mosaic of the Fornax cluster built on top of – and that's the main point – Datalink and SODA. Working this out with Hendrik made me realise the documentation of Datalink in pyVO really needs… love. Or, better, work.

    Later on Friday, there was the Registry session, where I gave brief (and somewhat cramped) talks on advanced column metadata (which is intended to one day let you query the registry for things like “roughly complete to 18 mag” or “having objects out to redshift 4“) and how to put VODataService 1.2 coverage into RegTAP – I expect you'll read more on both topics on this blog as they mature to a level at which this can leave the Registry nerd circles.

    And now, about 10 pm on Friday, the meeting is slowly winding down; beyond all the talks (which were, regrettably for a free software spirit like me, on zoom), the real bonus was that there was a gather.town attached to the conference. Now, that's a closed, proprietary, non-self-hostable platform, too, and so I have all reason to grumble. But: for the first time since February 2020 it felt like a conference, with the most useful action happening outside of the lecture halls, from trying to reach consensus on VEP-006 to teaching DaCHS datalink service declaration to learning about working with visibilities coming from VLBI (where it's even more difficult than it is with the big antenna arrays). So… this one time I've made my peace with proprietary platforms.

    A propos of “say no to platforms“ (in this case, slack): Due to the recent troubles with freenode, in addition to the Interop last week saw the the GAVO IRC channel move to libera.chat (where it's still #gavo). So, for instant messaging us now that the Interop is (in effect) over: Come there.

  • Semantics, Cross-Discipline Discovery, and Down-To-Earth Code

    Boxes-and-arrows view of the UAT

    A tiny piece of the Unified Astronomy Thesaurus as viewed by Sembarebro – the IVOA logos sit on terms that have VO resoures on them.

    Sometimes people ask me (in particular when I'm wearing my hat as the current chair of the IVOA Semantics working group) “well, what's this semantics thing good for?“ There are many answers, but here's one that nicely meshes with my pet subject data discovery: You want hierarchical, agreed-upon word lists to bridge discipline gaps.

    This story starts with B2FIND, a cross-disciplinary metadata aggregator for science data run within the framework of the European Open Science Cloud (EOSC). GAVO (or, more precisely, Heidelberg University's Astronomy) is involved in the EOSC via the ESCAPE project, and so I have had the pleasure of interacting with B2FIND for a while now. In particular, they are harvesting the metadata records of the Virtual Observatory Registry from us.

    This of course requires a bit of mapping, because the VO's metadata formats (VOResource, VODataService, and several extensions; see 2014A&C.....7..101D to learn more) are far too fine-grained for the wider scientific public. Not even our good friends from high-energy physics would appreciate being served links to, say, TAP endpoints (yet!). So, on our end we're mapping to the Datacite metadata kernel, which from VOResource is just a piece of XSL away (plus some perhaps debatable conventions).

    But there's more to this mapping, such as vocabularies of subject keywords. You might argue that in the age of rapid full text searches, keywords are dead. I would beg to disagree. For example, with good, hierarchical keyword systems you can, among many other useful things, offer topical browsing of metadata repositories. While it might not quite qualify as “useful” yet, the SemBaReBro registry browser I've hacked together late last year would be an example for such facilities – and might become part of our WIRR Registry searching tool one day.

    On the topic of subject keywords VOResource says that resources in the VO should be using the Unified Astronomy Thesaurus, specifically in its IVOA incarnation (not quite true yet, but true enough by blog standards). While few do, I've done a mapping of existing keywords in the VO to UAT concepts, which is what's behind SemBaReBro. So: most VO resources now have UAT concepts.

    However, these include concepts like AM Canum Venaticorum Stars, which outside of rather specialised circles of astronomers few people will ever have heard about (which, don't get me wrong, I personally regret – they're funky star systems). Hence, B2FIND does not bother with those.

    When we discussed the subject mapping for B2FIND, we thought using the UAT's top-level concepts might be a good start. However, at that point no VO resources at all actually used these, and, indeed, within astronomy that generally wouldn't make a lot of sense, because they are to unspecific to help much within the discipline. I postponed and then forgot about the problem – when the keywords of the resources weren't even from UAT, solving the granularity mismatch just wasn't humanly possible.

    That was the state of affairs until last Tuesday, when I had a mumble session with B2FIND folks and the topic came up again. And now, thanks partly to the new desise format proposed in the current Vocabularies in the VO 2 draft, things fell nicely into place: Hey, I have UAT concepts, and mapping these to the top-level terms isn't hard either any more.

    So, B2FIND gets the toplevel keywords they've been expecting all the time starting today. Yes: This isn't a panacea suddenly solving all the problems of cross-discipline data discovery, not the least because it's harder than one might think to imagine how such a thing would look like in practice. But given the complexities involved I was positively surprised how easy this particular part of the equation was.

    From here on, there's a bit of tech babble I intend to re-use in the RFC of Vocabularies in the VO 2; don't feel bad if you skip it.

    The first step was to make the mapping from UAT terms to the toplevel terms. The interesting part of the source I'm linking to here is:

    def get_roots_for(term, uat_terms):
      roots, seen = set(), set()
    
      def follow(t):
        wider = uat_terms[t]["wider"]
        if not wider:
          if not t in ROOT_TERMS:
            raise Exception(
              f"{t} found as a top-level term")
          roots.add(t)
        else:
          seen.add(t)
          for wider in uat_terms[t]["wider"]:
            follow(wider)
    
      follow(term)
      return roots
    

    There, uat_terms is essentially just a json-decode of what you get from the vocabulary URI if you ask for desise (see the draft spec linked to above for the technicalities). That's really it, and it even defends against cycles in the concept graph (which are legal by SKOS but shouldn't happen in the UAT) and detached terms (i.e., ones that are not rooted in the top-level terms). For what it does, I claim that's remarkably compact code.

    Once I had that, I needed to get the UAT-mapped subject keywords for the records I'm serving to datacite and fiddle the corresponding roots back in. That's technically a bit more involved because I am producing the datacite records on the fly from the XML representation for VOResource records that I keep in the database, and there's a bit of namespace magic involved (full code). Plus, the UAT-mapped keywords are only kept in the database, not in the metadata records.

    Still, the core operation here is relatively straightforward. Consider:

    def addUATToplevels(dataciteTree):
      # dataciteTree is an (lxml) ElementTree for the
      # result of the XSL transformation.  That's all
      # I have, and thus I first have to fiddle out
      # the identifier we are talking about
      ivoid =  dataciteTree.xpath(
          "//d:alternateIdentifier["
          "@alternateIdentifierType='ivoid']",
          namespaces={"d": DATACITE_NS}
        )[0].text.lower()
      # The .lower() is necessary because ivoids
      # unfortunately are case-insensitive, and RegTAP
      # normalises them to lowercase to retain sanity.
    
      # Now pull the UAT-mapped subject keywords from
      # our RegTAP extension (getTableConn is
      # DaCHS-internal API, but there's no magic in
      # there, it's just connection pooling with
      # guarantees against connections  idle in
      # transaction).
      with base.getTableConn() as conn:
        subjects = set(r[0] for r in
          conn.query("SELECT uat_concept"
            " FROM rr.subject_uat"
            " WHERE ivoid=%(ivoid)s", locals()))
    
      # This is the mapping itself: we do
      # roots-subjects to avoid adding
      # root terms that are already in
      # the record itself.  UAT_TOPLEVELS is the result
      # of the root finding discussed above.
      for term in subjects:
        root = UAT_TOPLEVELS[term]
        newRoots |= (root-subjects)
    
      # And finally fiddle in any new root terms found
      # into the datacite tree
      if newRoots:
        subjects = dataciteTree.xpath(
          "//d:subjects",
          namespaces={"d": DATACITE_NS})[0]
        for root in newRoots:
          newSubject = etree.SubElement(subjects,
            f"{{{DATACITE_NS}}}subject")
          newSubject.text = root
    

    Apart from the technicalities I'd again say that's pretty satisfying code.

    And these two pieces of code are really all I had to do to map between the vocabularies of different granularities – which I claim will probably be the norm as metadata flows between disciplines.

    It's great to see the pieces of a fairly comples puzzle fall into place like that.

Page 1 / 2 »