Loading...
 
ESA > Join & Share > Forums > LTDP SAFE > Simplification design trade-off

LTDP SAFE

Help

Show posts:
Jump to forum:

Simplification design trade-off

During the PDR-C collocation meeting it was agreed to evaluate if it is possible to reduce the number of SAFE Packages and files comprising the SAFE Packages or, on the contrary, if it is advisable to keep the existing approach in SAFE 1.3 (i.e. not to externalise the representation).

Note that the conclusions of this trade-off have a direct impact on the SAFE Packages structure proposed in the “Index Analysis Trade-Off” (PDGS-SAFE-GMV-TN-12/0186) and “EO Collection and EO Product metadata separation trade-off” (PDGS-SAFE-GMV-TN-12/0185).

The attached document “Simplification design trade-off” (PDGS-SAFE-GMV-TN-12/0171) provides the analysis and conclusions reached on this topic.

All your comments will be appreciated.

Best Regards.

Fernando Ibáñez (GMV)


Re: Simplification design trade-off

Please see the attached document for detailed comments on the trade-off for your consideration and discussion.
--
Stephan Zinke for EUMETSAT



Re: Simplification design trade-off

Dear Stephan,

Thank you very much for your comments and sorry for the delay. Please find attached the answer to your comments.

Best Regards.

Fernando Ibáñez (GMV)



Re: Simplification design trade-off

Dear Stephan,

Thank you very much for your comments and sorry for the delay. Please find attached the answer to your comments.

P.D.: Sorry, I have had problem with my browser...

Best Regards.

Fernando Ibáñez (GMV)


Re: Re: Simplification design trade-off

Dear Fernando,
thank you for your replies on my comments.
I will wait likely with more comments until other people have expressed their view(s) :-)
Regards
--
Stephan



Re: Simplification design trade-off

Dear All,

I cannot answer to all points of the discussion but still want to "express my view" :-D

1. The complexity of the simplification is striking me, but this is not a judgement as the matter is complex by nature.

2. In an archive I would expect Collection metadata as metadata to describe a set of products, each of which has exactly one parentIdentifier, i.e. its Collection identifer. Not building complex hierarchies which need to be archived and are subject to frequent change would simplify a lot. The Collection and Product metadata must be informative enough to be able to derive Thematic Collection hierarchies outside of the archive and especially not in the archive format.

3. Using an API to resolve schema locations for me is too much overhead.

3.a I suggest to archive all schemas which are not product type representation info (i.e. non-DFDL) in one Representation Information Package, but before that copy them to a filesystem structure which corresponds with the (relatively addressed) schemaLocation in the schema instances. If new versions of schemas or new schemas appear add them to the archived schema package and copy them to their schemaLocation.

3.b Take care that all generic schemas have a root so far in common that they don't need to be copied again for some archived product or a new product type.

4. I suggest to archive each product type Representation Information Package separately and that it is equally addressed via the Manifest file by all products of the same type.

5. Multiple DFDL versions: Should the product structure change this would be another product type, maybe with the same name but with new archived products which point to different representation information in their Manifest file. Different DFDL-Versions of the same product structure and semantics don't need to be kept.

Regards,
Bernhard



Re: Simplification design trade-off

Maybe the terms "API" and "Registry" are not the most accurate to use in the context of this trade-off and the externalization. I also don't see the need for any API and think it's an unnecessary complication.

For what concerns the Registry, maybe "Mapper", "Translator" or "Dictionary" are closer to the desired meaning. But whatever you call it, the point here is that if you have externalization you want to have your SAFE EO Product Packages as static as possible even in the event of relevant external changes (e.g. representation information language, location of representation information, auxiliary files, location of auxiliary files), because otherwise you are losing the only real advantage of externalizing, which is to avoid a full archive re-transcription when one of these changes happens. Hence, at least in my opinion, the need for "something" that translates logical (and more importantly static) references to physical references is clear. The idea is that the SAFE EO Product Package, and in particular the Manifest, will be generated once, with logical/static references only, and "never" changed again. Any changes that happen externally will only, in the worst case, affect this Registry/Mapper, that will, nevertheless, ensure the persistency of the logical/static references.

Now you can argue that this "something" is outside SAFE and to a certain extent this is true. However, a main purpose of SAFE besides long-term archiving is the exchange of data between archive owners and this trade-off, as explicitly requested by Bernhard, was also trying to tackle the issues of data migration between archives. For me, it is difficult to imagine how you can successfully and relatively painlessly migrate SAFE data from one archive to the other, especially bringing together all relevant auxiliary data (eventually the representation information as well), without hard defining within the SAFE standard as a minimum how these logical/static references should be (in my mind they have to be the same whether the product is in an ESA archive, a DLR archive or an EUMETSAT archive) but possibly also how this "something" should work (I agree that the design could be archive dependent for this, but the logical/static references are part of this "something", so it's not that you are completely free). The only alternative - which makes the migration process more painful - is to define migration procedures that include the editing of the Manifest of all incoming products to comply with the specific externalization mechanism implemented by the given archive.

Finally, for what regards the (surprising) conclusion that maybe externalization is not the way to go, we have encouraged GMV to complete this trade-off as if externalization would be a reality - you may question the whole point of doing the document if in the end the conclusions means that what you just wrote will not be used, but we feel that without it any decision would be less substantiated - and also liaised with other experts already involved in the design of previous SAFE versions. In the end, the truth is that we are not convinced about the solution, not as much because of its peculiarities but more at a fundamental level.

Particularly for Level-0 data (which is the main target of SAFE), the benefit of avoiding re-transcriptions due to external changes is a weak argument when compared to the complexity and cost (development, maintenance and operations) introduced by externalization. This is because Level-0 data will be relatively static and not very prone to change (unlike L1 and L2, for which new processor versions may introduce all kinds of changes - new aux data, new formats, etc.), to which you may add the fact that never re-transcribing your archive is not a realistic scenario anyway - according to best practices and the LTDP Guidelines, you will want at least to refresh the archive media, hardware and technology every 3/5 years, so you will indeed read all your L0 data anyway, and if you want you might even consider coordinating these refreshment activities with other changes (e.g. new DFDL versions) you may want to perform in your LTDP SAFE archive.

No decision has been made yet, and your feedback is more than welcome, it is very important, but I think my comment sums up what we have been considering.

Best Regards,

Paulo



Re: Simplification design trade-off

Please find another set of comments from EUMETSAT, attached.
Regards
--
Stephan


Re: Re: Simplification design trade-off

For other members' benefit and possible further contribution, here is an e-mail exchange between me and Stephan on this subject, starting from the PDF that Stephan attached to this post. Slightly edited from the original.

- - -

PAULO: we see the externalisation as costly due to the new and more complex tools that would have to be developed and maintained to work with such a standard (and/or archive complying with such a standard). We know that SAFE is not an archive standard but rather a data format standard and this, in fact, is one of the reasons for us considering that externalisation will be costly. By externalising the representation information you are putting something which is still logically a part of the product outside of it, so it comes as natural that you will want to impose conditions on the archive which hosts the SAFE products. Think for example of a tool like the SAFE TOOLBOX which exists today, but also the I/O Library: with no externalisation you can use these tools on a PC which does not even have a network connection, with externalisation you always need a connection to fetch the representation information, so your "local" tools are now "distributed", with all associated complexity. You can say that you don't need the representation information to read the product but this, to me, means you are no longer OAIS-compliant and without the representation information, frankly, I think SAFE adds no value at all, you might as well use another format.

STEPHAN: I can see your train of thought. And I do agree that just looking at a single SAFE package, it is likely more complex to maintain certain information outside the package. However, as was discussed during the various meetings, some benefits were as well seen in a possible externalisation (e.g. reducing re-transcription efforts, reducing redundancy, etc.)

I wouldn’t say try to mix what you do on your local PC creating a SAFE package with being OAIS-compliant; IMO those are 2 different topics. Clearly, externalisation increases complexity – there is no doubt. But I feel really that there are certain advantages. There is a reason why, e.g., certain data is hosted in the Internet on (static) addresses, with the aim to reduce the maintenance of redundant information.

PAULO: it is not disputed that SAFE is for AIP and not for the DIP (although there are no limitations). I fail to see the connection to the externalisation/no externalisation discussion though. My idea is that you exchange AIPs between archives, i.e. you don't generate a DIP which is sent to another archive which then re-converts this DIP to an AIP again.

STEPHAN: I am not sure that it is really that easy to exchange AIPs between two different archives. I would think more in the direction you described is more likely, create a DIP from the AIP which then becomes the SIP for the next archive. If all use the same format, fine, then this is straight forward.

I was mentioning this mainly related to my comment on the document (#20).

PAULO: do you agree at least that SAFE should define the logical identifiers to the externalised representation information (leaving the registry/mapper outside the SAFE standard)? If not, what is your proposed solution for the exchange of SAFE AIPs between archives? And for simple physical moves of representation information inside the archive?

STEPHAN: I don’t think I have a problem when SAFE defines the logical identifiers, I believe actually, it should do (how else then one would be able to define them, if it is not SAFE defining how to build them?). A physical move of representation information inside the archive would, IMO, not affect the logical identifiers (if cleverly constructed). I can think e.g. of using URNs and those URNs are then mapped to physical locations via some mapper, e.g. a catalogue. When representation information is moved, this would need updating consequently.

The topic of exchanging SAFE AIPs between archives hasn’t maybe been sufficiently addressed in the past. One way could be as described above AIP->DIP->SIP->AIP, as the archives might be implemented differently.

PAULO: we believe that your fears regarding the non-compliance of implementations for future missions stem maybe from the Sentinel-3 experience, where a SAFE-like standard was initially proposed, containing a kind of externalisation of the representation information. However, we have briefly discussed this with some of the involved people and our conclusion was that what has been done in that case is not externalisation of representation information, but rather a removal of all common and not useful parts (e.g. representation information) of SAFE to the outside of the packages. I have also learned that in that Ground Segment they do not really care at all about the content of the SAFE Manifest and in the end are just using SAFE as a dummy wrapper around their desired format. This means that it is not at all similar to SAFE. In fact, it has been requested that the SAFE acronym is not used in that context. SAFE is a very specific format, for Long Term Data Preservation, where representation information is the key to everything. In all day-to-day operational contexts the representation information has little to no value (in fact, personally I would recommend against using SAFE for this, where is the value?), people know those products and know how to use them and read them. Our concern here is for 10 or 20 years in the future when that will no longer be the case. For LTDP, I feel much SAFEr if the representation information is right there along with the product, although this is only one of the reasons for our current position.

STEPHAN: I do agree that what has been done for S3 (and I believe this is the same for S1 and S2) has not much to do anymore with SAFE. We are actually thinking of trying to mandate SAFE for all new missions (and products), but this is a lengthy process to agree upon.

PAULO: the two-fold approach seems as attractive as dangerous. Leaving this open, as an option, will lead to different implementations and the loss of the ability to painlessly exchange SAFE AIPs. I cannot see how we can have the same SAFE package regardless of whether the representation information is externalised or not. That approach will for sure facilitate adoption, but I am afraid at that point SAFE will no longer be a real standard (or it will become like some HMA standards, where so much is left to the implementers that in the end you have 10 different implementations of the same "standard" and, although the goal of HMA is interoperability, all those implementations have problems interoperating).

STEPHAN: Well, yes, I can see your point here. But is the question really mainly about exchanging AIPs? I doubt this. I felt more that the reason was LTDP related, i.e. keep everything in 10-20 or more years time accessible and meaningful.



Re: Simplification design trade-off

From the trade off presented it is clear that externalizing the representation information generates a complexity that compared to the benefits is not affordable. Therefore I agree with the conclusion to not use externalization.

For what concerns the registry, as SAFE is a product format, it is up to the archive to resolve the links between a SAFE product and its provenance information. However SAFE format should provide the logical referencing structure(e.g HREF and URI).

For what concerns Stephan proposal to have the representation information optional, it should be already possible using the SAFE operational class.


Re: Re: Simplification design trade-off

Hi Antonio,
I didn't propose "optional" but proposed to have a choice: either have the representation information internal xor external (i.e. one of them).
I cannot follow the trade-off in the sense that the complexity is so big that it won't be affordable.
If there is no externalisation, then, IMO, there is no need for a registry anymore. However, I do agree that SAFE should define logical linkages, e.g. via URI (or, IMO, more precise: URN) .
--
Stephan



Re: Simplification design trade-off

Dear all,

We have been discussing this issue internally and all the team thinks more or less the same. At the beginning we felt that the document was a bit far of the objective and did not give us enough information to understand the real problem. We found its comprehension a bit difficult and it seemed a bit oriented to the final decision. On the other hand the discussion was not new. Pros and contras of a possible externalization were already well-known by all of us and discussed during the various meetings. Of course externalisation increases complexity but other advantages were found that made the team agree a common work approach.

Reading the e-mails exchange between Paulo and Stephan on this subject (thanks for adding them to the wiki tool) we understand there are other additional problems than reliability or complexity, like tools to be used. We agree with Paulo that if users don’t have access to the rep info when reading a product, using SAFE or another format is the same.

Saying this, if the team involved in SAFE development finds this issue too complex or risky compared to possible benefits, it could have more sense to keep a single SAFE Package with all the information inside as planned at the beginning.

Best regards,
Maria



Re: Simplification design trade-off

Dear All,

the DLR conclusion on PDGS-SAFE-GMV-TN-12/0171 v1.0 LTDP SAFE Simplification design trade-off is as follows:

We consider the externalisation approach as valuable and object to dropping it.

Apart from losing the redundancy benefit, repackaging of all AIPs of the same product type due to non-externalised representation information could be costly. To postpone this to a future media refreshment activity means the need to manage the repackaging schedule i.e. it introduces an operational procedure.
Regarding the static level-0 products example - isn't it an even bigger nuisance if you would not touch the data except for media refreshment?

Being fairly new standards we expect changes in DFDL and in the metadata schemas also.

Regarding identifiers we also propose to use URNs with a consistent naming scheme allowing for a simple mapping to physical addresses, possibly together with the archive system catalogue which will have (a) configured base address(es). The semantics of the naming scheme should be preserved.
This would solve the migration issue (AIP exchange). References to external packages will be never or all equally changed (e.g. a datacenter prefix), and the physical address mapping semantics is well defined.

If there is a SAFE package for each and every non-DFDL XML-Schema plus one for each product Type DFDL - so be it. The non-DFDL schemas are only a few compared to the number of EO Product SAFE packages, and in principle not growing. But to have a good share of them in each EO Product SAFE package is a heavy load, especially the OGC and ISO schemas. The external schema packages can also be created automatically as their representation info is the definition of XML-Schema and therefore part of the Knowledge Base.
One DFDL schema per product type is also close to zero and only slowly growing with peaks for new missions.

If an EO Product SAFE package is extracted in an environment where the XML-Schemas are already extracted, the metadata can be immediately validated without touching the respective Representation Information packages by using the SchemaLocation attribute from the metadata XML file. So in a DIP the XML-Schema Representation Information packages could be optional depending on the recipient's previous activities.

Also, for reading a single EO Product SAFE package with a networkless PC you would only need to produce a DIP containing all referenced information packages, now all referenced locally, and if you already have the non-DFDL schemas or representation information packages, only the DFDL package(s) and auxiliary packages are needed.
In order to only examine the EO Product with metadata (without metadata validation) the non-DFDL schemas and possibly the auxiliary files are also not needed.

So we don't think that resolving all external references without duplicates makes reading or export/migration software much more expensive but it is clear that more bookkeeping is necessary.

Regarding Auxiliary files we propose to treat then as "normal" products with a DFDL schema per Auxiliary file type (in analogy with product type).

Best regards,
Bernhard


Re: Re: Simplification design trade-off

Not surprisinlg, as it reflects EUMETSAT's point of view, I fully agree with Bernhard's point of view. A good summary.
--
Stephan



Show posts:
Jump to forum: