Posts with the Tag Data Discovery:

  • Global Dataset Discovery in PyVO

    A Tkinter user interface with inputs for Space, Spectrum, and Time, a checkbox marked "inclusive", and buttons Run, Stop, Broadcast, Save, and Quit.

    Admittedly somewhat old-style: As part of teaching global dataset discovery to pyVO, I have also come up with a Tkinter GUI for it. See A UI for more on this.

    One of the more exciting promises of the Virtual Observatory was global dataset discovery: You say “Give me all spectra of object X that there are“, and the computer relates that request to all the services that might have applicable data. Once the results come in, they are merged into some uniformly browsable form.

    In the early VO, there were a few applications that let you do this; I fondly remember VODesktop. As the VO grew and diversified, however, this became harder and harder, partly because there were more and more services, partly because there were more protocols through which to publish data. Thus, for all I can see, there is, at this point, no software that can actually query all services plausibly serving, say, images or spectra in the VO.

    I have to say that writing such a thing is not for the faint-hearted, either. I probably wouldn't have tackled it myself unless the pyVO maintainers had made it an effective precondition for cleaning up the pyVO Servicetype constraint.

    But they did, and hence as a model I finally wrote some code to do all-VO image searches using all of SIA1, SIA2, and obscore, i.e., the two major versions of the Simple Image Access Protocol plus Obscore tables published through TAP services. I actually have already reported in Tucson on some preparatory work I did last summer and named a few problems:

    • There are too many services to query on a regular basis, but filtering them would require them to declare their coverage; far too many still don't.
    • With the current way of registering obscore tables, there is no way to know their coverage.
    • One dataset may be availble through up to three protocols on a single host.
    • SIA1 does not even let you constrain time and spectrum.

    Some of these problems I can work around, others I can try to fix. Read on to find out how I fared so far.

    The pyVO API

    Currently, the development happens in pyVO PR #470. While it is still a PR, let me point you to temporary pyVO docs on the proposed pyvo.discover module – of course, all of this is for review and probably not in the shape it will remain in[1].

    To quote from there, the basic usage would be something like:

    from pyvo import discover
    from astropy import units as u
    from astropy import time
    
    datasets, log = discover.images_globally(
      space=(339.49, 3.1, 0.1),
      spectrum=650*u.nm,
      time=(time.Time('1995-01-01'), time.Time('1995-12-31')))
    

    At this point, only a cone is supported as a space constraint, and only a single point in spectrum. It would certainly be desirable to be more flexible with the space constraint, but given the capabilities of the various protocols, that is hard to do. Actually, even with the plain cone Obscore (i.e., ironically, the most powerful of the discovery protocols covered here) currently results in an implementation that makes me unhappy: ugly, slow, and wrong. This is requires a longer discussion; see Appendix: Optionality Considered Harmful.

    datasets at this point is a list of, conceputally, Obscore records. Technically, the list contains instances of a custom class ImageFound, which have attributes named after the Obscore columns. In case you have doubts about the Semantics of any column, the Obscore specification is there to help. And yes, you can argue we should create a single astropy table from that list. You are probably right.

    PyVO adds an extra column over the mandatory obscore set, origin_service. This contains the IVOA identifier (IVOID) of the service at which the dataset was found. You have probably seen IVOIDs before: they are URIs with a scheme of ivo:. What you may not know: these things actually resolve, specifically to registry resource records. You can do this resolution in a web browser: Just prepend https://dc.g-vo.org/I/ to an IVOID and paste the result into the address bar. For instance, my Obscore table has the IVOID ivo://org.gavo.dc/__system__/obscore/obscore; the link below the IVOID leads you to an information page, which happens to be the resource's Registry record formatted with a bit of XSLT. A somewhat more readable but less informative rendering is available when you prepend https://dc.g-vo.org/LP/ (“landing page”).

    The second value returned from discover.images_globally is a list of strings with information on how the global discovery progressed. For now, this is not intended to be machine-readable. Humans can figure out which resources were skipped because other services already cover their data, which services yielded how many records, and which services failed, for instance:

    Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Obscore GAVO Data Center Obscore Table: 2 records
    SIA2 The VO @ ASTRON SIAP Version 2 Service: 0 records
    SIA2 ivo://au.csiro/casda/sia2 skipped: ReadTimeout: HTTPSConnectionPool(host='casda.csiro.au', port=443): Read timed out. (read timeout=20)
    SIA2 CADC Image Search (SIA): 0 records
    SIA2 European HST Archive SIAP service: 0 records
    ...
    

    (On the skipping, see Relationships below). I consider this crucial provenance, as that lets you assess later what you may have missed. When you save the results, be sure to save these, too.

    A feature that will presumably (see Inclusivity for the reasons for this expectation) be important at least for a few years is that you can pass the result of a Registry query, and pyVO will try to find services suitable for image discovery on that set of resources.

    A relatively straightforward use case for that is global obscore discovery. This would look like this:

    from pyvo import discover
    from pyvo import registry
    from astropy import units as u
    from astropy import time
    
    def say(discoverer, s):
            print(s)
    
    datasets, log = discover.images_globally(
      space=(274.6880, -13.7920, 1),
      time=(time.Time('1995-01-01'), time.Time('1995-12-31')),
      services=registry.search(registry.Datamodel("obscore")),
      watcher=say)
    

    The watcher thing lets you, well, watch the progress of the discovery; it receives an instance of the discoverer -- this is so you can abort a discoverer's activities from within some UI -- and the human-readable string to display or process in some other way.

    A UI

    To get an idea whether this API might one day work for the average astronomer, I have written a Tkinter-based GUI to global image discovery as it is now: tkdiscover (only available from github at this point). This is what a session with it might look like:

    Lots of TOPCAT windows with various graphs and tables, an x-ray image of the sky with overplotted points, and a play gray window offering the specification of space, spectrum, and time constraints.

    The actual UI is in the top right: A plain window in which you can configure a global discovery query by straightfoward serialisations of discover.images_globally's arguments:

    • Space (currently, a cone in RA, Dec, and search radius, separated by whitespace of commas)
    • Spectrum (currently, a single point as a wavelength in metres)
    • Time (currently, either a single point in time – which probably is rarely useful – or an interval, to be entered as civil DALI dates
    • Inclusivity.

    When you run this, this basically calls discover.images_globally and lets you know how it is progressing. You can click Broadcast (which sends the current result to all VOTable clients on the SAMP bus) or Save at any time and inspect how discovery is progressing. I predict you will want to do that, because querying dozens of services will take time.

    There is also a Stop button that aborts the dataset search (you will still have the records already found). Note that the Stop button will not interrupt running network operations, because the network library underneath pyVO, requests, is not designed for being interrupted. Hence, be patient when you hit stop; this may take as long as the configured timeout (currently is 20 seconds) if the service hangs or has to do a lot of work. You can see that tkdiscover has noticed your stop request because the service counter will show a leading zero.

    Service counter? Oh, that's what is at the bottom right of the window. Once service discovery is done, that contains three numbers: The number of services to query, the number of services queried already, and the number of services that failed.

    The table contains the obscore records described above, and the log lines are in the discovery_log INFO. I will give you that this is extremely unreadable in particular in TOPCAT, which normalises the line separators to plain whitespace. Perhaps some other representation of these log lines would be preferable: A PARAM with a char[][] (but VOTable still is terrible with arrays of variable-length strings)? Or a separate table with char[*] entries?

    Inclusivity

    I have promised above I'd explain the “Inclusive” part in both the pyVO API and the Tk UI. Well, this is a bit of a sad story.

    All-VO-queries take time. Thus, in pyVO we try to only query services that we expect serve data of interest. How do we arrive at expectations like that? Well, quite a few records in the Registry by now declare their coverage in space and time (cf. my 2018 post for details).

    The trouble is: Most still don't. The checkmark at inclusive decides whether or not to query these “undecidable” services. Which makes a huge difference in runtime and effort. With the pre-configured constraints in the current prototype (X-Ray images a degree around 274.6880, -13.7920 from the year 1995), we currently discover three services (of which only one actually needs to be queried) when inclusive is off. When it is on, pyVO will query a whopping 323 services (today).

    The inclusivity crisis is particularly bad with Obscore tables because of their broken registration pattern; I can say that so bluntly because I am the author of the standard at fault, TAPRegExt. I am preparing a note with a longer explanation and proposals for fixing matters – <cough> follow me on github –, but in all brevity: Obscore data is discovered using something like a flag on TAP services. That is bad because the TAP services usually have entriely different metadata from their Obscore table; think, in particular, of the physical coverage that is relevant here.

    It will be quite a bit of effort to get the data providers to do the Registry work required to improve this situation. Until that is done, you will miss Obscore tables when you don't check inclusive (or override automatic resource selection as above) – and if you do check inclusive, your discovery runs will take something like a quarter of an hour.

    Relationships

    In general, the sheer number of services to query is the Achilles' heel in the whole plan. There is nothing wrong with having a machine query 20 services, but querying 200 is starting to become an effort.

    With multi-data collection services like Obscore (or collective SIA2 services), getting down to a few dozen services globally for a well-constrained search is actually not unrealistic; once all resources properly declare their coverage, it is not very likely that more than 20 institutions worldwide will have data in a credibly small region of space, time, and spectrum. If all these run collective services and properly declare the datasets to be served by them, that's our 20-services global query right there.

    However, pyVO has to know when data contained in a resource is actually queriable by a collective service. Fortunately, this problem has already been addressed in the 2019 endorsed note on Discovering Data Collections Within Services: Basically, the individual resource declares an IsServedBy relationship to the collective service. PyVO global discovery already looks at these. That is how it could figure out these two things in the sample log given above:

    Skipping ivo://org.gavo.dc/lswscans/res/positions/siap because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    Skipping ivo://org.gavo.dc/rosat/q/im because it is served by ivo://org.gavo.dc/__system__/obscore/obscore
    

    But of course the individual services have to declare these relationships. Surprisingly many already do, as you can observe yourself when you run:

    select ivoid, related_id from
    rr.relationship
    natural join rr.capability
    where
    standard_id like 'ivo://ivoa.net/std/sia%'
    and relationship_type='isservedby'
    

    on your favourite RegTAP endpoint (if you have no preferences, use mine: http://dc.g-vo.org/tap). If you have collective services and run individual SIA services, too, please run that query, see if you are in there, and if not, please declare the necessary relationships. In case you are unsure as to what to do, feel free to contact me.

    Future Directions

    At this point, this is a rather rough prototype that needs a lot of fleshing out. I am posting this in part to invite the more adventurous to try (and break) global discovery and develop further ideas.

    Some extensions I am already envisaging include:

    • Write a similar module for spectra based on SSAP and Obscore. That would then probably also work for time series and similar 1D data.

    • Do all the Registry work I was just talking about.

    • Allow interval-valued spectral constraints. That's pretty straightforward; if you are looking for some place to contribute code, this is what I'd point you to.

    • Track overflow conditions. That should also be simple, probably just a matter of perusing the pyVO docs or source code and then conditionally produce a log entry.

    • Make an obscore s_region out of the SIA1 WCS information. This should also be easy – perhaps someone already has code for that that's tested around the poles and across the stitching line? Contributions are welcome.

    • Allow more complex geometries to define the spatial region of interest. To keep SIA1 viable in that scenario it would be conceivable to compute a bounding box for SIA1 POS/SIZE and do “exact” matching locally on the coarser SIA1 result.

    • Enable multi-position or multi-interval constraints. This pretty certainly would exclude SIA1, and, realistically, I'd probably only enable Obscore services with TAP uploads with this. With those constraints, it would be rather straightforward.

    • Add SODA support: It would be cool if my ImageFound had a way to say “retrieve data for my RoI only”. This would use SODA and datalink to do server-side cutouts where available and do the cut-out locally otherwise. If this sounds like rocket science: No, the standards for that are actually in place, and pyVO also has the necessary support code. But still the plumbing is somewhat tricky, partly also because pyVO's datalink API still is a bit clunky.

    • Going async? Right now, we civilly query one service after the other, waiting for each result before proceeding to the next service. This is rather in line with how pyVO is written so far.

      However, on the network side for many years asynchronous programming has been a very successful paradigm – for instance, our DaCHS package has been based on an async framework from the start, and Python itself has growing in-language support for async, too.

      Async allows you to you fire off a network request and forget about it until the results come back (yes, it's the principle of async TAP, too). That would let people run many queries in parallel, which in turn would result in dramatically reduced waiting times, while we can rather easily ensure that a single client will not overflow any server. Still, it would be handing a fairly powerful tool into possibly unexperienced hands… Well: for now there is no need to decide on this, as pyVO would need rather substantial upgrades to support async.

    Appendix: Optionality Considered Harmful

    The trouble with obscore and cones is a good illustration of the traps of attempting to fix problems by adding optional features. I currently translate the cone constraint on Obscore using:

    "(distance(s_ra, s_dec, {}, {}) < {}".format(
      self.center[0], self.center[1], self.radius)
    +" or 1=intersects(circle({}, {}, {}), s_region))".format(
      self.center[0], self.center[1], self.radius))
    

    which is all of ugly, presumably slow, and wrong.

    To appreciate what is going on, you need to know that Obscore has two ways to define the spatial coverage of an observation. You can give its “center” (s_ra, s_dec) and something like a rough radius (s_fov), or you can give some sort of geometry (e.g., a polygon: s_region). When the standard was written, the authors wanted to enable Obscore services even on databases that do not know about spherical geometry, and hence s_region is considered rather optional. In consequence, it is missing in many services. And even the s_ra, s_dec, s_fov combo is not mandatory non-null, so you are perfectly entitled to only give s_region.

    That is why there are the two conditions or-ed together (ugly) in the code fragment above. 1=intersects(circle(.), s_region) is the correct part; this is basically how the cone is interpreted in SIA1, too. But because s_region may be NULL even when s_ra and s_dec are given, we also need to do a test based on the center position and the field of view. That rather likely makes things slower, possibly quite a bit.

    Even worse, the distance-based condition actually is wrong. What I really ought to take into account is s_fov and then do something like distance(.) < {self.radius}+s_fov, that is, the dataset position need only be closer than the cone radius plus the dataset's FoV (“intersects”). But that would again produce a lot of false negatives because s_fov may be NULL, too, and often is, after which the whole condition would be false.

    On top of that, it is virtually impossible that such an expression would be evaluated using an index, and hence with this code in place, we would likely be seqscanning the entire obscore table almost every time – which really hurts when you have about 85 Million records in your Obscore table (as I do).

    The standard could immediately have sanitised all this by saying: when you have s_ra and s_dec, you must also give a non-empty s_fov and s_region. This is a classic case for where a MUST would have been necessary to produce something that is usable without jumping through hoops. See my post on Requirements and Validators on this blog for a longer exposition on this whole matter.

    I'm not sure if there is a better solution than the current “if the operators didn't bother with s_region, the dataset's FoV will be ignored“. If you have good ideas, by all means let me know.

    [1]

    If you want to try this (in particular without clobbering your “normal” pyVO), do something like this:

    virtualenv --system-site-packages global-datasets
    . global-datasets/bin/activate
    cd global-datasets
    git clone https://github.com/msdemlei/pyvo
    cd pyvo
    git checkout global-datasets
    pip install .
    
  • News From the VO Via ActivityPub

    Screenshot of a browser showing the Mastodon rendering of GAVO's ActivityPub feed

    If you ask us: Get a proper client to join the Fediverse. But as shown here, in a pinch a web browser will do, too.

    When Twitter was still fairly young, we had an account there that would tweet out when new data collections appeared in the VO. Even back then, I was rather doubtful whether using a proprietary platform to disseminate open data is a good idea, but as long as the content was also available through standard protocols (RSS in this case), I thought it might be worth a try. Well: It never really took off, and after Twitter broke the whole thing a couple of times by incompatible API changes, I finally let it go ca. 2017.

    Given to the recent mass exodus from the smouldering remains of Twitter into the open and standard Fediverse, I thought reviving our little missives there might actually be a worthwhile effort. Specifically, joining Mastodon – which speaks the ActivityPub protocol and hence is part of the Fediverse – has become really straightforward.

    So, if the VO Fresh RSS Feed is not for you (perhaps because you do not have an RSS aggregator, which would be a shame), maybe following our new Mastodon account @gavo@botsin.space would be for you?

    Followup (2024-11-21)

    In late 2024, botsin.space shut down, and we moved our operations to @gavo@astrodon.social; so, please point your fediverse clients there.

    Oh, and yes, I give you the previews the Mastodon web client produces for VizieR resources are not overly pretty yet (curse Javascript templating!), but then if I were you, I'd disable URL previews anyway; really, they are little more than a privacy annoyance.

    This post also doubles as identity verification, so:

    Visit Visit Our Mastodon Page.
  • Towards Data Discovery in pyVO

    When I struggled with ways to properly integrate TAP services – which may have hundreds or thousands of different resources in one service – into the VO Registry without breaking what we already had, I realised that there are really two fundamentally different modes of using the VO Registry. In Discovering Data Collections's abstract I wrote:

    the Registry must support both VO-wide discovery of services by type ("service enumeration") and discovery by data collection ("data discovery").

    To illustrate the difference in a non-TAP case, suppose I have archived images of lensed quasars from Telescopes A, B, and C. All these image collections are resources in their own right and should be separately findable when people look for “resources with data from Telescope A“ or perhaps “images obtained between 2011-01-01 and 2011-12-31”.

    However, when a machine wants to find all images at a certain position, publishing the three resources through three different services would mean that that machine has to do three requests where one would work just as well. That is very relevant when you think about how the VO will evolve: At this point there are 342 SIAP services in the VO, and when you read this, that number may have grown further. Adding one service per collection will simply not scale when we want to keep the possibility of all-VO searches. Since I claim that is a very desirably thing, we need to enable collective services covering multiple subordinate resources.

    So, while in the first (“data discovery”) case one wants to query (or at least discover) the three resources separately, in the second case they should be ignored, and only a collective “images of lensed quasars” service should be queried.

    The technical solution to this requirement was creating “auxliary capabilities” as discribed in the endorsed note on discoving data collections cited above. But these of course need client support; VO clients up to now by and large do service enumeration, as that has been what we started with in the VO Registry. Client support would, roughly, mean that clients would present their users with data collections, and then offer the various ways to to access them.

    There are quite a number of technicalities involved in why that's not terribly straightforward for the “big” clients like TOPCAT and Aladin (though Aladin's discovery tree already comes rather close).

    Now that quite a number of people use pyVO interactively in jupyter notebooks, extending pyVO's registry interface to do data discovery in addition to the conventional service enumeration becomes an attractive target to have data discovery in practice.

    I have hence created pyVO PR #289. I think some the rough edges will need to be smoothed out before it can be merged, but meanwhile I'd be grateful if you could try it out already. To facilitate that, I have prepared a jupyter notebook that shows the basic ideas.

    Followup (2023-12-15)

    I have just prepared a slightly updated version of the notebook.

    To run it while the PR is not merged, you need to install the forked pyVO. In order to not clobber your main installation, you can install astropy using your package manager and then do the following (assuming your shell is bash or something suitably similar):

    virtualenv --system-site-packages try-discoverdata
    . try-discoverdata/bin/activate
    cd try-discoverdata
    git clone https://github.com/msdemlei/pyvo
    cd pyvo
    git checkout add-discoverdata
    python3 setup.py develop
    ipython3 notebook
    

    That should open a browser window in which you can open the notebook (you probably want to download it into the pyvo checkout in order to make the notebook selector see it). Enjoy!

  • DaCHS 2.4 is out: Blind discovery, pretty datalink, and more

    DaCHS screenshots and logo

    DaCHS 2.4: automatic ranges (with registry support!), pretty datalink (with vocabulary support!). And then the usual bunch of improvements (hopefully!).

    I have released DaCHS 2.4 today, and as usual for stable releases, I would like to have something like a commented changelog here so DaCHS deployers perhaps look forward to upgrading – which would be good, because there are far too many outdated DaCHSes out there.

    Among the more notable changes in version 2.4 are:

    Blind discovery overhaul. If you've been following my requests to include coverage metadata three years ago, you have probably felt that the way DaCHS started to hack your RDs to include the metadata it had obtained from the data was a bit odd. Well, it was. DaCHS no longer does that when running dachs limits. While you can still do manual overrides, all the statistics gathered by DaCHS is now kept in the database and injected into the DaCHS' internal idea of your RDs at loading time.

    I have not only changed this because the old way really sucked; it was also necessary because I wanted to have per-column metadata routinely, and since in advanced DaCHS there often are no XML literals for columns (because of active tags), there wouldn't be a place to keep information like what a column is minimally, maximally, in median, or as a “2σ range“ within the RD itself. A longer treatment of where this is going is given in the IVOA note Blind Discovery 2: Advanced Column Statistics that Grégory and I have recently uploaded.

    For you, it's easy: Just run dachs limits q once you're happy with your data, or perhaps once a month for living data, and leave the rest to DaCHS. A fringe benefit: in browser froms, there are now value ranges of the various numeric constraints as placeholders (that's the screenshot on the left in the title picture).

    There is a slight downside: As part of this overhaul, DaCHS is now computing the coverage of SIAP and SSAP services based on the footprints of the products as MOCs. While that gives much more precise service footprints, it only works with bleeding-edge pgsphere as delivered in Debian bullseye – or from our Debian repository. If you want to build this from source, you need to get credativ's pgsphere fork for now.

    Generate column elements: If you have tables with many columns, even just lexically entering the <column> elements becomes straining. That is particularly annoying if there already is a halfway machine-readable representation of that data.

    To alleviate that, very early in the development of DaCHS, I had the gavo mkrd subcommand that you could feed FITS images or VOTables to get template RDs. For a number of reasons, that never worked well enough to make me like or advertise it, and I eventually ended up writing dachs start instead, which is something I like and advertise for general usage.

    However, what that doesn't do is come up with the column declarations. To make good on this, there is now a dachs gencol command that will, from a FITS binary table, a VOTable, or a VizieR-style byte-by-byte description, generate columns with as much metadata as it can fathom. Paste that into the output of dachs start, and, depending on your input format, you should have a quick start on a fairly full-featured data collection (also note there's dachs adm suggestucds for another command that may help quickly generate rich metadata).

    This currently doesn't work for products (i.e., tables of spectra, images, and the like); at least for FITS arrays, I suppose turning their non-obvious header cards into columns might save some work. Let's see: your feedback is welcome.

    Refurbished Datalink XSLT: Since the dawn of datalink, DaCHS has delivered Datalink documents with XSLT stylesheets in order to have nicely formatted pages rather than wild XML when web browsers chance on datalink documents. I have overhauled the Javascript part of this (which, I have to admit, is what makes it pretty). For one, the spatial cutout now works again, and it's modeless (no clicking “edit“ any more before you can drag cutout vertices). I'm also using the datalink/core vocabulary to furnish link groups with proper titles and descriptions, and to have them sorted in in a proper result tree. I've talked about it at the interop, and I've prepared a showcase of various datalink documents in the Heidelberg data centre.

    Update to DaCHS 2.4 and you'll get the same thing for your datalinks.

    Non-product datalinks: When writing a datalink service, you have to first come up with a descriptor generator. DaCHS will provide a simple one for you (or perhaps a bit more complex ones for FITS images or spectra) – but all of these assume that whatever the datalink ID parameter references is in DaCHS' product table. It turned out that in many interesting cases – for instance, attaching time series to object catalogues – that is not the case, and then you had to write rather obscure code to keep DaCHS from poking around in the product table.

    No longer: There is now the //datalink#fromtable descriptor generator. Just fill in which column contains the identifier and the name of the table containing that column and you're (basically) done. Your descriptor will then have a metadata attribute containing the relevant row – along with everything else DaCHS expects from a datalink descriptor.

    gavo_specconv: That's a longer story covered previously on this blog.

    Index declaration in views: Saying on which columns a database index exists allows users to write smart queries, and DaCHS uses such information internally when rewriting geometrical expressions from ADQL to whatever is in use in the actual database. Hence, making sure these indexes are properly declared is important. But at the same time it's difficult for views, because postgres doesn't let you have indexes on views (for good reasons). Still, queries against views will (usually) use indexes of their underlying tables, and hence those should be declared in the corresponding metadata.

    This is tedious in general. DaCHS now helps you with the //procs#declare-indexes-from stream. Essentially, it will compare the columns in the view with the ones from the source tables and then guess which view columns correspond to indexed columns from the source tables; using that, it adds indexed flags to some view columns.

    If all this is too weird for you: Thanks to declare-indexes-from, the index declaration now automatically happens in the modern way to build SSAP services, the //ssap#view mixin. Hence, chances are you won't even see this particular STREAM but just notice its beneficial consequences.

    Sunsetting resources: I've been fiddling off and on with a smart way to pull resources I no longer want to maintain while still leaving a tombstone. I had to re-visit this problem recently because I dropped the Gaia DR1 table from my Heidelberg data centre. So, how do I explain to people why the thing that's been there no longer is?

    In general, this is a rather untractable problem; for instance, it's very hard to do something sensible with the TAP_SCHEMA entries or the VOSI tables endpoints for the tables that went away. Pure web pages, on the other hand, can be adorned with helpful info. To enable that, there is now the superseded meta item, which you define in the RD that once held the resources. For Gaia DR1, here's what I used:

    <meta name="superseded" format="rst">
      We do not publish Gaia DR1 data here any more.
      If you actually need DR1 data, refer to the
      full Gaia mirrors, for instance `the one at
      ARI`_.  Otherwise, please use more recent data
      releases, for instance `eDR3`_.
    
      .. _the one at ARI: http://gaia.ari.uni-heidelberg.de
      .. _eDR3: /browse/gaia/q3
    </meta>
    

    Root page template: I slightly streamlined the default root page template, in particular dropping the "i" and "Q" icons for going to the metadata and querying the service. If you have overridden the root template, you may want to see if you want to merge the changes.

    As usual, there are many more small repairs and additions, but most of these are either very minor or rather technical. One last thing, though: DaCHS now works with Python 3.8 (3.7 will continue to be supported for a few years at least, earlier 3.x never was), which is going to be the python3 in Debian bullseye. Bullseye itself will only have DaCHS 2.3 (with the Python 3.8 fixes backported), though. Once bullseye has become stable, we will look into putting DaCHS 2.4 into the backports.

  • Semantics, Cross-Discipline Discovery, and Down-To-Earth Code

    Boxes-and-arrows view of the UAT

    A tiny piece of the Unified Astronomy Thesaurus as viewed by Sembarebro – the IVOA logos sit on terms that have VO resoures on them.

    Sometimes people ask me (in particular when I'm wearing my hat as the current chair of the IVOA Semantics working group) “well, what's this semantics thing good for?“ There are many answers, but here's one that nicely meshes with my pet subject data discovery: You want hierarchical, agreed-upon word lists to bridge discipline gaps.

    This story starts with B2FIND, a cross-disciplinary metadata aggregator for science data run within the framework of the European Open Science Cloud (EOSC). GAVO (or, more precisely, Heidelberg University's Astronomy) is involved in the EOSC via the ESCAPE project, and so I have had the pleasure of interacting with B2FIND for a while now. In particular, they are harvesting the metadata records of the Virtual Observatory Registry from us.

    This of course requires a bit of mapping, because the VO's metadata formats (VOResource, VODataService, and several extensions; see 2014A&C.....7..101D to learn more) are far too fine-grained for the wider scientific public. Not even our good friends from high-energy physics would appreciate being served links to, say, TAP endpoints (yet!). So, on our end we're mapping to the Datacite metadata kernel, which from VOResource is just a piece of XSL away (plus some perhaps debatable conventions).

    But there's more to this mapping, such as vocabularies of subject keywords. You might argue that in the age of rapid full text searches, keywords are dead. I would beg to disagree. For example, with good, hierarchical keyword systems you can, among many other useful things, offer topical browsing of metadata repositories. While it might not quite qualify as “useful” yet, the SemBaReBro registry browser I've hacked together late last year would be an example for such facilities – and might become part of our WIRR Registry searching tool one day.

    On the topic of subject keywords VOResource says that resources in the VO should be using the Unified Astronomy Thesaurus, specifically in its IVOA incarnation (not quite true yet, but true enough by blog standards). While few do, I've done a mapping of existing keywords in the VO to UAT concepts, which is what's behind SemBaReBro. So: most VO resources now have UAT concepts.

    However, these include concepts like AM Canum Venaticorum Stars, which outside of rather specialised circles of astronomers few people will ever have heard about (which, don't get me wrong, I personally regret – they're funky star systems). Hence, B2FIND does not bother with those.

    When we discussed the subject mapping for B2FIND, we thought using the UAT's top-level concepts might be a good start. However, at that point no VO resources at all actually used these, and, indeed, within astronomy that generally wouldn't make a lot of sense, because they are to unspecific to help much within the discipline. I postponed and then forgot about the problem – when the keywords of the resources weren't even from UAT, solving the granularity mismatch just wasn't humanly possible.

    That was the state of affairs until last Tuesday, when I had a mumble session with B2FIND folks and the topic came up again. And now, thanks partly to the new desise format proposed in the current Vocabularies in the VO 2 draft, things fell nicely into place: Hey, I have UAT concepts, and mapping these to the top-level terms isn't hard either any more.

    So, B2FIND gets the toplevel keywords they've been expecting all the time starting today. Yes: This isn't a panacea suddenly solving all the problems of cross-discipline data discovery, not the least because it's harder than one might think to imagine how such a thing would look like in practice. But given the complexities involved I was positively surprised how easy this particular part of the equation was.

    From here on, there's a bit of tech babble I intend to re-use in the RFC of Vocabularies in the VO 2; don't feel bad if you skip it.

    The first step was to make the mapping from UAT terms to the toplevel terms. The interesting part of the source I'm linking to here is:

    def get_roots_for(term, uat_terms):
      roots, seen = set(), set()
    
      def follow(t):
        wider = uat_terms[t]["wider"]
        if not wider:
          if not t in ROOT_TERMS:
            raise Exception(
              f"{t} found as a top-level term")
          roots.add(t)
        else:
          seen.add(t)
          for wider in uat_terms[t]["wider"]:
            follow(wider)
    
      follow(term)
      return roots
    

    There, uat_terms is essentially just a json-decode of what you get from the vocabulary URI if you ask for desise (see the draft spec linked to above for the technicalities). That's really it, and it even defends against cycles in the concept graph (which are legal by SKOS but shouldn't happen in the UAT) and detached terms (i.e., ones that are not rooted in the top-level terms). For what it does, I claim that's remarkably compact code.

    Once I had that, I needed to get the UAT-mapped subject keywords for the records I'm serving to datacite and fiddle the corresponding roots back in. That's technically a bit more involved because I am producing the datacite records on the fly from the XML representation for VOResource records that I keep in the database, and there's a bit of namespace magic involved (full code). Plus, the UAT-mapped keywords are only kept in the database, not in the metadata records.

    Still, the core operation here is relatively straightforward. Consider:

    def addUATToplevels(dataciteTree):
      # dataciteTree is an (lxml) ElementTree for the
      # result of the XSL transformation.  That's all
      # I have, and thus I first have to fiddle out
      # the identifier we are talking about
      ivoid =  dataciteTree.xpath(
          "//d:alternateIdentifier["
          "@alternateIdentifierType='ivoid']",
          namespaces={"d": DATACITE_NS}
        )[0].text.lower()
      # The .lower() is necessary because ivoids
      # unfortunately are case-insensitive, and RegTAP
      # normalises them to lowercase to retain sanity.
    
      # Now pull the UAT-mapped subject keywords from
      # our RegTAP extension (getTableConn is
      # DaCHS-internal API, but there's no magic in
      # there, it's just connection pooling with
      # guarantees against connections  idle in
      # transaction).
      with base.getTableConn() as conn:
        subjects = set(r[0] for r in
          conn.query("SELECT uat_concept"
            " FROM rr.subject_uat"
            " WHERE ivoid=%(ivoid)s", locals()))
    
      # This is the mapping itself: we do
      # roots-subjects to avoid adding
      # root terms that are already in
      # the record itself.  UAT_TOPLEVELS is the result
      # of the root finding discussed above.
      for term in subjects:
        root = UAT_TOPLEVELS[term]
        newRoots |= (root-subjects)
    
      # And finally fiddle in any new root terms found
      # into the datacite tree
      if newRoots:
        subjects = dataciteTree.xpath(
          "//d:subjects",
          namespaces={"d": DATACITE_NS})[0]
        for root in newRoots:
          newSubject = etree.SubElement(subjects,
            f"{{{DATACITE_NS}}}subject")
          newSubject.text = root
    

    Apart from the technicalities I'd again say that's pretty satisfying code.

    And these two pieces of code are really all I had to do to map between the vocabularies of different granularities – which I claim will probably be the norm as metadata flows between disciplines.

    It's great to see the pieces of a fairly comples puzzle fall into place like that.

Page 1 / 2 »