Forum: LTDP SAFE

SAFE Design

According to the discussion from the Simplification trade-off the following SAFE Design - Technical Note has been performed.

The attached zip file “PDGS-SAFE-GMV-TN-13-0230_SAFE_Design” provides the analysis and conclusions reached on this topic. Two files are included in the zip: the SAFE Design document (PDGS-SAFE-GMV-TN-13/0230) and a Manifest example file.

All your comments will be appreciated.

Best Regards.

Fernando Ibáñez (GMV)


Re: SAFE Design

Just a note to say, because it has not been said here before, that in the last days of December ESA decided to go for externalisation in SAFE 2.0 and in the meantime has discussed with GMV a compromise between the solutions presented in the Simplification trade-off and a simpler solution that still implements externalisation, taking into account the feedback received from the Review Panel and the internal discussion at ESA.

The TN published here by GMV presents the proposed design for SAFE 2.0 with externalisation and this will be the final generic approach. We are looking for feddback from the Review Panel on the TN to finalise the details of the design.

Paulo



Re: SAFE Design

EUMETSAT has analysed the document provided. Please find attached our comments.
Regards
--
Stephan Zinke


Re: Re: SAFE Design

Thanks for the feedback Stephan, these are useful points for discussion and I let GMV reply first of all.

I just wanted to say, regarding your point "3. Section 4 b.", that the reason for this - GMV may correct me if I'm wrong - is that you want the possibility to add additional Auxiliary files in association to an EO Product SAFE package after that SAFE package has been created, without changing the actual EO Product package. So, you have one logical ID for the EO Product Package's "Provenance Information" and an additional mapper to map these IDs to several Auxiliary Data SAFE packages' logical IDs. The fact that two "mappers" are necessary has not been explained and should be in the TN, if we think that this is a good solution.

As to why we would like to have this, one of ESA's main reasons is that it would allow us to start massive SAFE conversion of existing EO data without needing/depending on the associated Auxiliary Data to have been already converted. When we start an EO Product conversion activity, we don't want to be blocked by the Auxiliary data (we also have to admit that while work on the EO Product specialisations is more advanced, because we can reuse the work done for SAFE 1.3, most auxiliary data has to be specialised from scratch and will probably lag behind). Then, once auxiliary data SAFE specialisations are ready, we want to simply be able to perform auxiliary data conversion to SAFE and "implement" the association to EO Products via this other mapper, i.e. without having to change the EO Product packages, even if only at Manifest level.

Paulo


Re: Re: Re: SAFE Design

Dear Paulo,
thanks for the explanation on why this was introduced - IMO, if that's the sole reason it should be mentioned in the document.
However (!!!): I believe it is the wrong approach. I understand that by what you describe you have invented an uncontrollable hook to link data. That means as well, that although an EO product does not change, it may link to different aux files, depending on when you retrieve them from the archive. In my understanding this is a violation of data integrity.
While I understand the reason for this; I cannot say it looks like a clean solution to me.
I am not completely understanding why the conversion of aux-data is the problem here. Is it, because you assume when creating the EO Package that you wouldn't know the logical names of the aux packages and files, yet? I believe this can be solved by a applying a rigorous naming scheme (which one needs in any case). Obviously you as well want to refer to single aux files, that's why you have this counter as part of the logical identifier. I fail to understand actually what is the problem here. Maybe you can expand.
It would be interesting to see the opinions of other panel members.


Re: Re: Re: Re: SAFE Design

I was not trying to explain why this was introduced - for that I wait for GMV -, I just wanted to mention one case where it may be useful.

You are absolutely right about the uncontrollable hook and the data integrity, but, if we have externalisation, integrity becomes quite tricky because it has to be assessed on a distributed set of multiple files (how do you calculate an MD5sum on that for example?). What is the object of the integrity check? If it is just the EO Product SAFE Package, then its integrity will not change in this scenario, otherwise it will (yes, at different points in time it will come with different associated auxiliary files).

I also think that the uncontrollable hook was already there in SAFE 1.3, if I look at an example Manifest of ENVISAT ASAR I see references to auxiliary files, roughly matching the file name, but with SAFE 1.3 there were no provisions on how this physical link would be implemented and if there is no registry for these IDs, they are little more than a text string.

Maybe GMV can explain better, but I see two different things being proposed here:

1) the link to the externalised representation information
2) the link to the provenance information

For 1), it is true that it is being proposed to refer to single files, but these are XML or DFDL schemas inside Representation Information packages. Furthermore, the reference to single files is being done directly from inside the EO Product SAFE Package, even if using a logical identifier.

For 2), it is not being proposed to refer to single files from within the EO Product SAFE Package, only the URN-to-URN mapper is referring to single files. On the EO Product SAFE Package Manifest, there is a logical identifier for the product's "provenance information" and a 1-to-many relationship is then implemented via the mapper. Not knowing the names of the aux packages at the time of EO Product creation can, indeed, be solved by the rigorous naming scheme, but only for auxiliary files you know will be linked in the future. But is it an assumption that we can make, that the set of linked files will not change?

Paulo


Re: Re: Re: Re: Re: SAFE Design

Dear Stephan,

Find attached our answers to your comments. In any case, here is the GMV’s point of view regarding the open discussion in this thread.

As Paulo mentioned, there are two kinds of link in the new design: 1) representation information links (to refer one single schema files) and 2) provenance information links (which refer to a single package, but not single files).

For 1) we have decided to provide a specific URN reference to a single schema file, because in our understanding, it cannot be assumed that all schema versions introduced in a package (dfdl or metadata schemas) will be backward compatible (especially in operational cases). If this assumption is not correct, then the “file identifier” that we have defined in the URN for the schema files could be discarded. Therefore, only a reference to the whole package would be needed.

For 2) we decided to adopt the "URN-URN" mapper solution taking into account what Paulo has explained before: we have to assure that the EO Product packages won’t be modified once they are archived even if additional Auxiliary files (unknown in advance) are added afterwards. In our understanding, it is not only a matter of the auxiliary package’s name but an issue of unexpected auxiliary packages that could be archived in the future. This is why we considered that this cannot be solved by simply introducing a rigorous naming schema (which is in our opinion something against the requirements, where it is said that the packages naming convention should be a recommendation).

The use of this "URN-URN" mapper could be discarded if it assumed that all the applicable auxiliary data are well-known beforehand when a manifest file is created (this is in principle acceptable for ended missions).

In this case, it would be possible to include (in the manifest file) as many references as auxiliary files have been used to generate the EO Product (even if not all the auxiliary files are available in the archive). However, we see some drawbacks in this approach:

  • It has to be assumed that the archive should provide a manager for missing links among packages to handle those references to auxiliary packages that are not present in the archive.
  • In certain cases, the final manifest file would be relatively large taking into account that some product types make use a lot of auxiliary files. This would difficult the manifest reading.
  • For operational missions (correct me if I’m wrong) the auxiliary files can vary in time so it is not possible to know which files will be needed in the future and this would imply a modification of the manifest file in all archived EO Product Packages to include the new reference (or to remove the “obsolete” ones).
  • In terms of long-term usability, we consider that it is not useful to have one link to a package that it is not preserved (in SAFE format) within the LTDP archive. The information of those non-SAFE auxiliary data is assumed to be included in the applicable mission documentation.

Re: Re: Re: Re: SAFE Design

Dear All,
please find the DLR conclusions on the SAFE Design TN below:

Section 3: I think it is uncritical (except for interpreter software) to exchange DFDL schemas as they only describe the same product type in different ways. It must only be guaranteed that a new DFDL language version has at least the same expressive power as used for the SAFE product type descriptions in the previous language version. But this is only the case because DFDL schemas don’t have schema instances.
Therefore it is critical if metadata schemas are changed! You cannot expect backwards compatibility, so the logical identifiers must contain the schema version as Stephan proposed and as it is convention for schema location directory structures. A renaming scheme like adding a timestamp shall not be applied as it breaks the metadata validity! Adding a new schema is sufficient.

Section 4:
4.1: It is essential to NOT flatten out the metadata schema directory structure in metadata schema SAFE packages if you ever want to validate metadata. So for metadata schema SAFE packages the assumption of simply specifying the target filename for local references is wrong. Except if the file is a container (zip, tar, …) and you want to refer to the complete schema directory tree.

4.2: I think that a “file_identifier” should be the filename. If filenames are allowed in manifests why not in logical ids. Very easy to map.
Also, it could be helpful for being able to refer to schema directories like
“ogc/gml/3.1.1/” instead of just files. Still easy to map to the target file system syntax.
Typo: Change the definition to
“ = ::”.

4.3: Regarding the 1:n mapping of logical id to logical id, I fully agree with Stephan that a single logical identifier should be sufficient to reference any file in an external SAFE package. Acknowledging Paulo’s concern I would allow a 1:1 mapping for URN to URN in case a logical identifier goes astray during a deferred auxiliary SAFE package generation. But not 1:n! In addition each aux file shall be referenced in the manifest. So there would not be “additional auxiliary files”, only replacements for void pointers and no opaque mappings. However, I can imagine that it is difficult to find the right replacements if the references are broken. Meaningful file identifiers will help in this case too.
BTW I assume that ancillary products are subsumed under auxiliary here.

The “obsolete schema file” Mapper procedure in table 4-3 is wrong except for DFDL-schemas. Reassociating the URN will break metadata validity if the schemas are not backwards compatible.

Kind regards,
Bernhard


Re: Re: Re: Re: Re: SAFE Design

Somehow the typo correction string has been swallowed by the forum software. Iwanted to suggest in the NSS definition to move the colon from the left of the opening square bracket to the right of it.

Re: Re: Re: Re: Re: SAFE Design

Dear Bernhard,

First of all, thanks to you and Stephan in particular for all your valuable feedback. We have been discussing all the issues very thoroughly with GMV and they will reply here about everything probably still today.

Regarding your points, I wanted however to say (and ask) two things:

3: GMV will propose to use directory structures identifying the RI package version inside each RI package. So, the schema renaming issue will not subsist. However, I was curious about why you say that it would not work. The TN says clearly that when the schema is renamed "the registration (i.e. mapper entry) will associate the existing URN to the new filename.". If this would be done, then it would seem to work (e.g. Manifest points to XYZ and mapper has an entry XYZ:schema.xsd . If schema.xsd becomes schema_21022013.xsd and you update the mapper entry to XYZ:schema_21022013.xsd, the Manifest does not need to be updated and the metadata RI is now pointing to schema_21022013.xsd instead of schema.xsd, so it is still the same schema used at the time of creation and you could validate using it). Unless you are saying that you will not be able to validate the metadata because of other schemas in the validation chain (the ones that schema_21022013.xsd depends on), is that the case?

4.3: I had the opportunity to discuss face-to-face with Stephan at ESRIN two weeks ago and we said that in principle ESA would be in favor of dropping this second 1:N mapper for auxiliary files because of the complexity it introduces, even if without it there are disadvantages in my view. In fact, I don't think the actual issue it tries to solve was understood.
ESA's concern is much more about the fact that at the time of EO Product creation we may have insufficient knowledge about the auxiliary files to be linked, but still want to go ahead with SAFE conversion, than with not having a "known" auxiliary file not ready at the time of EO Product creation. This is a very important distinction, because in the first case we have no idea of which auxiliary files were used, so it's even difficult to define a naming convention and put "void pointers" (how many?) in the Manifest (of course in this case it could be argued that the specialisation is incomplete and that when more information becomes available it will be updated, leading to a "new" EO Product and therefore it makes sense to re-transcribe), while in the second case we know about the auxiliary files and, if they are not physically available, can put a "void pointer" on the Manifest.
I did not understand your proposal for allowing a 1:1 relationship (not allowing a 1:N relationship) that still involves a URN:URN mapping (instead of a URN:physical mapping). I don't understand in particular if with URN:URN you mean "URN of an EO Product -> URN of an auxiliary file" (in this case it seems more or less the same that GMV proposed, but with different cardinality), or "(placeholder) URN of auxiliary file at time of EO Product creation-> URN of auxiliary file when it is finally created".

Thanks,

Paulo


Re: Re: Re: Re: Re: Re: SAFE Design

Dear Paulo,

sorry for the late reply, I was ill last week.

Regarding your two points:
>3: GMV will propose ...
With the temporal distance to my writing I admit that it might work with the mapper. But I consider it as rather error-prone considering the many schema files that could change. It would riquire a lot of mapping to just unpack the standard schemas for an aged (i.e. with some schema changes) SAFE product.

>4.3: I had the opportunity ...
There I only had in mind that you could map a void pointer to a legal logical identifier, i.e. a URN wich points nowhere to a URN which could be mapped to a physical address, in order to avoid a change in the SAFE Product from which the auxiliary file shall be referenced.
And yes, this is the GMV proposal with only 1:1 cardinality.
Regarding 1:1 vs 1:N I wanted to avoid to open Pandora's box of hard to trace relationships.

I also don't quite understand the scenario of an unknown number of unknown auxiliray files: If you archive an EO product, isn't the provenance information completely known? So I assume now that you are referring to non-nominal cases where, say, an old archive has not archived the complete provenance info but wants to transcribe their archived data to SAFE?

Regards,
Bernhard


Re: Re: Re: Re: Re: Re: Re: SAFE Design

>
> I also don't quite understand the scenario of an unknown number of unknown auxiliray files: If >you archive an EO product, isn't the provenance information completely known? So I assume now that you are referring to non-nominal cases where, say, an old archive has not archived the complete provenance info but wants to transcribe their archived data to SAFE?
>

I should have been more concrete in fact. Yes, this is what I am talking about. We are already facing exactly this kind of problem for our Third-Party Missions, but even for ERS LBR data, for which the processors have been developed and are still largely under control of the ground facilities, more than ESA itself (e.g. DLR for GOME, UK for ATSR, CNES for Radar Altimeter, etc.).

For ERS LBR, although we have a rather complete list of all auxiliary files to consider, there simply do not exist documents describing their format, often because they are in some ad-hoc binary format decided by the developer on-the-fly (from my experience, this happens when they want to use some standard model for weather, DEM, gravity, etc. and just wrap it in a format which makes their life easier). Also the documentation about the processing chain is usually fragmented and in disparate formats, which means that it takes quite some time to arrive at a complete picture of what to preserve and link - it is also typical that you don't have this provenance information as part of the metadata, or somehow registered. In some cases, even, it is disputable whether a piece of information should be seen as data or as part of the program/processor, since it may come bundled with it and change with the version of the processor (this is more for ancillary than auxiliary data).

I have a very similar case for SeaStar (Orbview-2) SeaWiFS. See for example the discussion on this thread about a "elements.dat" file, in which an actual processor developer on the NASA side participates (the other guy was working for ESA). At ESA we have a SeaWiFS processor that relies on this file, but as you can see there, you don't have a document describing its structure (this is not important for the processor because it relies on a tool to read the file). Eventually, from the tips on that thread, by contacting other people for help, by analysing the tools and some reverse engineering, you may understand how it's structured, but this takes time and money and doing it for each and every auxiliary/ancillay file you want to preserve is a large task.

This latter case could be solved by a "void pointer" and/or your 1:1 mapper solution (because we know about the file and could define URNs), but what about other even more complex cases? SeaWiFS is actually a relatively recent mission and we still have a lot of useful knowledge and industry know-how. But we already have data from 1970s missions and this is no longer the case.

So, in the end, the issue is just that a 1:N mapper would be useful for addressing this kind of case of insufficient knowledge, not having to re-transcribe EO Products once you know more (or doing partial additions as you build your knowledge). This is, of course, with the intention of starting to convert things to SAFE, because if you wait until you know "everything" (whatever that means), I think you will never do it. Anyway, ESA's current position is that we can do without this 1:N mapping and accept the risk that we may need to re-transcribe EO Products if we want to preserve different provenance information. This is acceptable because the scale of the problem is limited: the advantage of very old missions is that the data volumes are very small for today's standards.

Paulo


Re: Re: SAFE Design

First of all, thanks to all for your feedback about the SAFE Design. As Paulo has commented before, we have been discussing all the issues very thoroughly with ESA and here is our point of view (henceforth “new approach”) regarding the open discussion in this thread:


1. Package directory structure.

The internal files of the SAFE EO Product/EO Auxiliary/EO Collection Packages will be organised according the following structure:

< package_name > 
             |_____  < manifest_file > 
             |_____ “measurement”  
                       |_____  < measurement_file(s) > 
             |_____ “metadata” 
                       |_____  < metadata_file(s) > 
             |_____ “index”  
                         |_____  < index_file(s) > 



The “measurement” directory is not used in EO Collection Packages, but it is mandatory for SAFE EO Product and EO Auxiliary Packages.

The “index” optional directory will be only part of the SAFE EO Product Packages.

Notation: The naming criteria for text between < > will be described in the Recommendation for Specialisation document. On the other hand, the text between “ “ is proposed to be fixed.


2. Versioning.

A sequential "Representation Information version number" directory will be used for the Representation Information Packages, and, even if only one of the schemas is updated, the new version directory should include a copy of all "previous" schemas (starting with 1.0).

The internal files of the SAFE Representation Information Packages will be organised according the following structure:

< package_name > 
            < manifest_file > 
            |_____ “safe-xfdu” 
                       |_____ [ version ] 
                                 |------- < file_name > 
             |_____ “conformance” 
                       |_____ [ version ] 
                                 |------- < file_name > 
             |_____ “measurement” 
                       |_____ [ version ] 
                                 |_____ < file_name > 
             |_____ “metadata” 
                       |_____ [ version ] 
                                 |_____ < file_name > 
             |_____ “index” 
                       |_____ [ version ] 
                                |_____ < file_name > 


Note that not all the directories will exist in all SAFE Package types. Thus:

  • For Base Schemas Packages: All the directories (i.e. “safe-xfdu”, “conformance”, “measurement”, “metadata” and “index”) are mandatory.
  • For DFDL Schemas Packages: Only the “measurement” and “index” directories are mandatory, the other directories are not used in these packages.
  • For Metadata Schemas Packages: the “safe-xfdu”, “conformance” and “metadata” directories are mandatory, the other directories are not used in these packages.


Where the files included for each directory are the followings:

  • The “safe-xfdu” directory will contain the Specialised XFDU Schemas.
  • The “conformance” directory will contain the Specialised Conformance Schemas.
  • The “measurement” directory will contain the Specialised DFDL Schemas for the EO Product or Auxiliary files.
  • The “metadata” directory will contain the Specialised OGC EOP O&M Schemas and Auxiliary Schemas.
  • The “index” directory will contain the Specialised DFDL Schemas for the index files.
  • The Base Schema Package is a particular package where the base Schemas and SAFE base Schemas will be added to each directory accordingly.


Notation: The naming criteria for text between < > will be described in the Recommendation for Specialisation document. On the other hand, the text between “ “ is proposed to be fixed and the text between [ ] will change depending on the version.

As the directory structure including a version number approach will be adopted, then the timestamp described in the TN approach will not be needed.

Different versions of a DFDL schema could exist in the same Rep. Info Package, but it is assumed to be always backward compatible.

It is assumed that the metadata representation information is not backward compatible, and identical EO products containing different metadata structure, will be considered as different EO Products.


3. Logical Identifier.

It has been concluded than the URNs should reference to files instead of packages. As a consequence three URN types can be specified in the manifest file according to the new approach:

  • Representation Information: One URN for each DFDL or Metadata Schema file.
  • EO Auxiliary file: One URN for each EO Auxiliary file.
  • EO Collection: One URN for each Metadata collection file.


The URN for the external files taking into account the previous conclusions will be as follows:

< URN > = urn:
< NID > = x-safe:
< NSS > = < mission_identifier >:< package_name >:< category >:< version >:< file_name >

Where:

  • < category > can be “safe-xfdu”, “conformance”, “measurement”, “metadata” or “index”.
  • < version > is the version directory seen above.


An example is as follows:

urn:x-safe:ENVISAT:EN01_ASA_WS_0P_MTD:metadata:1.0:safe-atm-spec.xsd
urn:x-safe:ENVISAT:EN01_ASA_WS_0P_DFDL:measurement:1.0:dfdl-spec.xsd


The URNs should be known beforehand as you know the filenames of the schemas, auxiliary files and collection metadata files. However, a tool (specific by mission/product) should be in charge of generating them, but this is out of the scope of the SAFE.

The naming criteria will be described in the Recommendation for Specialisation document.


4. URN Mappers.

In the TN approach two mappers were proposed: URN-Physical and URN-URN. However, according what has been discussed in the forum, and the new proposal for the URNs specification (URN references to files instead of packages) only a URN-Physical mapper will be used instead.

All the URN references to the EO Auxiliary files should be specifically written in the Manifest of each EO Product Package.

Therefore, there are three possible cases:

  • The EO Auxiliary file is known and the EO Auxiliary file exists physically in the archive. In this case the URN could be created without problems.
  • The EO Auxiliary file is known and the EO Auxiliary file does not exist physically in the archive. In this case the URN could be created. However, the physical location in the mapper cannot be set.
  • The EO Auxiliary file is not known. In this case the URN cannot be created. Only when the “unknown” auxiliary file becomes available, the EO Product Package would be updated (re-transcribed). The drawback of the EO Product Package re-transcription is assumed.


With respect to the “void pointers” commented in the previous post, in our opinion, there are no differences if we include a “void pointer” or we don’t include anything in the manifest because a re-transcription is always needed in this case.
We assume that the generation of an identical EO Product (to include only the new reference to a previously “unknown” auxiliary file) should be discarded in order to avoid redundancy in the archive.

It is not very clear for us if it is more costly to generate a new EO Product Package including the new references to the auxiliary data (and removing the existing EO Product package in the archive) or on the contrary, is it easier to simply update the existing EO Product to include the appropriate references. Do you have any hint?


5. Other issues.

It is agreed not to use “obsolete” but “previous” when talking about schema versioning context because it could lead to confusion.

An updated version of the manifest examples will be posted in the forum once all the doubts have been clarified.


Re: Re: Re: SAFE Design

Dear GMV,
I have a few questions regarding your post, as below:

re. 1)
a) "The naming criteria for text between < > will be described in the Recommendation for Specialisation document".
I would assume that the naming of the manifest file will still be described in the base SAFE standard.

b) Will the structure presented here be part of the SAFE standard or part of the recommendation for specialisation?


re. 2)
a) "A sequential "Representation Information version number" directory ..."
I am wondering what exactly you mean by 'sequential'. Why would it need to be sequential (when seeing 'sequential' I assume whole numbers like 0,1,2, etc., but can I use as well 1.1, 1.2, 1.3, 1.6?, can I have gaps in the numbering?)

b) "even if only one of the schemas is updated, the new version directory should include a copy of all "previous" schemas "
What is meant by "include copues of all previous schemas"? I would understand that the directory "v1.0" would contain a file abc.xsd, and "v2.0" as well, but how could the v1.0 file be in the same directory? And why would it need to be?

c) Then I would as well assume that you can have more than one file in the directory, so the proper lingo would be

d) "backward compatible of DFDL"
I guess, one cannot assume this, hence no limitation shall be placed.

e) "identical EO products containing different metadata structure, will be considered as different EO Products".
If you say "identical" you probably refer to the measurement files etc. I would assume this is not only applicable to the different metadata structures but as well to say different DFDL versions.
There is a case in Sentinel 3 where we are to keep 2 different baseline versions of a same product. What the difference exactly is, is not known today. One reason could be different interpretation of the data in the measurement files, i.e. different "DFDL" schemas.


re. 3)
a) Why shall the naming convention for URNs be only described in the recommendation for specialisation and not part of the (base) SAFE standard?

b) Please highlight that the URNs used for references need as well to be used inside the package (manifest) where the referenced file is located. [cf. comment om 4) below)

re. 4)
It is concluded that only URN-physical mapper will remain. It should be pointed out that "physical" in this case means a physical package, not a physical file. It should be explained then as well, how one would actually retrieve the referenced file (e.g.: step 1) locate the package the file is contained in via the URN-physical mapper. step 2) retrieve the package. step 3) open the manifest to find the physical reference of the file mentioned by the URN)

--
Stephan Zinke for EUMETSAT


Re: Re: Re: Re: SAFE Design

Dear Stephan,

Thank you very much for your quick response. Please, find below our comments.

re. 1)
a) Yes, the manifest filename specification is part of the Core specification.

b) Yes, The structure will be part of the SAFE standard.

re. 2)
a) Ok, “arbitrary” word is better for this case, the version numbers: 1.1, 1.2, etc. are ok for this case.

b) The v1.0 file will be not in the same “v2.0” directory, it means, only the “v2.0” file will be located in the “v2.0” directory. Our idea is to have the previous schemas to keep the consistency, but not the previous version of the updated schemas. For example, the representation information for the metadata files (i.e. Specialized OGC EOP O&M). In this case in the same directory you can have several files (i.e. safe-eop.xsd, safe-opt.xsd, safe-atm.xsd, etc.) with dependency between them. The directory "v1.0" would contain the files e.g. safe-eop.xsd, safe-opt.xsd, safe-atm.xsd and the directory “v2.0” would contain the files e.g. safe-eop.xsd (updated), safe-opt.xsd, safe-atm.xsd and the directory (supposing that only the safe-eop.xsd file from the “v1.0” has been updated). The other two files safe-opt.xsd, safe-atm.xsd are added to the “v2.0” directory to keep the consistency.

This approach was taken from internal discussion with ESA and because we think Bernhard in the past was suggesting this option too. Bernhard, please, can you confirm this? Thanks! smile

c) See our previous comment.

d) We are assuming that all DFDL Schemas included in a single package will be compatible because it represents the same binary data. Non compatible DFDL schemas will be placed in a separated SAFE DFDL Schemas Package.

e) If you have one measurement and two different metadata, i.e. different metadata structures for the same measurement file, would lead to a different EO Product Package.

re. 3)
a) The naming convention for URNs will be part of the SAFE standard.

b) Please, note the different between “internal references” inside the same Package, in this case the URNs are not used. And the “external references”, in this case the URNs are used pointing to the file across the mapper.

re. 4)
It depends on the mapper implementation because the URN is still pointing to the file, others implementation approaches about the “physical” values in the mapper could be performed, and as is known, the mapper implementation is out of the SAFE scope. However, your approach is ok for us too. This will be used to show the examples in the Core Specification document.

Best Regards,

GMV Team.


Re: Re: Re: Re: Re: SAFE Design

Dear GMV,
thanks for the quick responses. I'm mostly fine with what you say, except I have the following additional comments:
2b) Thanks for clarifying this. For sure, all the files which are necessary for one version need to kept together, even if some of them actually do not change.
2d) What if the product format of a product changes during its lifetime, e.g. because you update the on-board SW due to some problems. Then you would have a separate DFDL schema package? Although it is in principal still the same product type? This is no completely clear to me. If what you say is true the question would be why at all to have the need for versioned DFDL schemas...
3/4)
regarding the internal and external mapping.
From what I understand the (external) URN-physical mapper can only map a URN to a physical package, but not file, as this is the smallest entity you persist in the LTA.
But then you say, truly, that the URN actually identifies a single file. Now you need to map the (externally referenced) package and the file found in the URN. My assumption is that this can be only sensibly done using the manifest of the referenced file. The proposal would be to use the same URNs there...

Regards
--
Stephan


Re: Re: Re: Re: Re: Re: SAFE Design

Dear Stephan,

Thank you again for your quick response.

2d) We see your point of view. The DFDL versioning was concluded from the internal discussion with ESA because ESA was interested in having DFDL versions as well. Paulo, please, correct us if we are wrong.

3/4) Stephan, we are a bit confused with this point, all the GMV Team have had an internal discussion about this point and we would like to clarify it asap. Our understanding is as follows:

  • It is concluded that only URN-physical mapper will remain.
  • The URN will be compliant with the following format:
urn:x-safe:< mission_identifier >:< package_name >:< category >:< version >:< file_name >
  • The Manifest of the e.g. EO Product Package will point to the external files (contained in a RI Package) using the above URN format via the mapper. This is for us the “external references”.
  • The URN-physical mapper will have the relationship between the URNs (as has been described above) and the physical location of the referenced package (i.e. physical RI Package). Note that in the mapper all the URNs of the files contained in the same RI Package will have the same physical package mapping.
  • In this case, as you has been commented before, the steps are:
      • Step 0. The URN is read from the Manifest e.g. EO Product Package.
      • Step 1. Locate the RI Package where the external file is contained via the URN-physical mapper.
      • Step 2. Retrieve the RI Package because you already known the physical location of this package via the mapper.
      • Step 3. Open the Manifest file of the RI Package to find the physical reference of the file mentioned by the URN. Note that with the proposed URN format the physical reference of the file could be known by < category >:< version >:< file_name >. Although, ok, the format is different because the “/” is used instead of “:”.

For example:
urn:x-safe:ENVISAT:EN01_ASA_WS_0P_MTD:metadata:1.0:safe-atm-spec.xsd (URN).
“fileLocation” element with the href=”metadata/1.0/safe-atm-spec.xsd” (This is for us the “internal references”).

Stephan, do we share the same approach?

We don’t understand the sentence “The proposal would be to use the same URNs there...”. Are you proposing to use the URN as “internal references” as well?

Thank you very much for all.

Best Regards,

GMV Team.


Re: Re: Re: Re: Re: Re: Re: SAFE Design

Dear GMV,
I believe the only question open is on how do you identify (retrieve) the actual file in step 3, i.e. how to unambiguously identify the file itself.
I would propose not to implement anything static in a sense that you woulnd't open the manifest (e.g. by translating the information of the URN directly into a physical path). That leaves us with opening the manifest first. Now, I wonder, how do you identify the element which defines the referenced file? This can be done, IMO, only if you use the same URN (or parts of it, tbc) inside the manifest as the @ID attribute of an element which would then point via the filelocation element through the href attribute to the physical path.
May I suggest that you provide an XML snippet of the manifest of the externalised package to show the details in your approach?
I.e., I am not clear how you apply/build the internal reference.
Regards
--
Stephan


Re: Re: Re: Re: Re: Re: Re: Re: SAFE Design

Dear Stephan,

As has been commented in the previous comment, our idea is to have a self-contained URN format, it means, with the proposed URN format the physical reference of the file could be known by:

< category >:< version >:< file_name >

The above piece of the URN format should be the same as the internal structure directories of the package. Therefore, a priori, it is not needed to open the Manifest to known the location of the file in the package. However, a “resolver tool/API” should be needed in the archive to perform the translation from the URN format (which uses “:”) to physical path format (which uses “/”), but this “resolver tool/API” is considered out of the SAFE scope.

An URN example could be the following:

urn:x-safe:ENVISAT:EN01_ASA_WS_0P_MTD:metadata:1.0:safe-atm-spec.xsd

With the mapper you know the location of the physical package in the archive, it means, using URN-Physical (urn:x-safe:ENVISAT:EN01_ASA_WS_0P_MTD – Physical) mapping. Once the physical location of the package is known, the next step is to concatenate the physical package with “metadata/1.0/safe-atm-spec.xsd”.

With respect to the identification of the referenced file in the Manifest (i.e. supposing the Manifest opened), the @ID attribute of the “dataObject” containing the “fileLocation” could be identified by the file name (which is part of the URN format) using a “resolver tool/API” to translate from from the URN format (which uses “:”) to ID format (which uses “_”). Please, see the following XML snippet:

 < dataObjectSection >
      < dataObject ID="metadata_1.0_safe-atm-spec" repID="safeATMBaseSchema" >
         < byteStream mimeType="application/octet-stream" >
            < fileLocation locatorType="url" textInfo="Metadata Schema Data" 
                           href="metadata/1.0/safe-atm-spec.xsd"/ >
            < checksum checksumName="MD5">93d1daca226ba957472c567a7fc33427
         < /byteStream >
      < /dataObject >
< /dataObjectSection > 


Why not use the ":" in the @ID attribute? Because the @ID attribute is a xsd:ID type, which must be an xsd:NCName, and in this type the colon (“:”) is forbidden.

On the other hand, why not use the "/" in the URN format? Because the RFC2141 (URN Syntax) say the following about the NSS format:
"RFC 1630 reserves the characters "/", "?", and "#" for particular purposes. The URN-WG has not yet debated the applicability and precise semantics of those purposes as applied to URNs. Therefore, these characters are RESERVED for future developments. Namespace developers SHOULD NOT use these characters in unencoded form, but rather use the appropriate %-encoding for each character."

And we prefer avoid to use the %-encoding.

Best Regards,

GMV Team.


Re: Re: Re: Re: Re: Re: Re: Re: Re: SAFE Design

Dear GMV,
I have understood your approach, however, to be hones, I must say I don't really "like" it. A lot of logic seems to be implied in this approach which is partly limiting partly not intuitive.
Frankly, in my understanding, one wouldn't need to be able to deduct the location of a certain file from the URN itself.
May I suggest an alternative approach, to be discussed:
Do not include the physical name of the file in the URN, but something similar, e.g.
urn:x-safe:< mission_identifier >:< package_name >:< category >:< version >:< file_id >, where file_id could be something "logical" to uniquely identify a file, e.g. safe-atm-spec-1.0
It would be debatable then if you'd still need the version as part of the URN, but I don't see problems having it there, actually, I like it being a separate part of the URN.
However, you could even use something arbitrary as the file_id, e.g. "123456".
The @ID of the referenced dataobject then would use this file-id, e.g.
< dataObjectSection >
< dataObject ID="safe-atm-spec-1.0" repID="safeATMBaseSchema" >
< byteStream mimeType="application/octet-stream" >
< fileLocation locatorType="url" textInfo="Metadata Schema Data"
href="metadata/1.0/safe-atm-spec.xsd"/ >
< checksum checksumName="MD5">93d1daca226ba957472c567a7fc33427
< /byteStream >
< /dataObject >
< /dataObjectSection >
with this approach your decoupling the URN from the physical location, which I believe would be a good approach.

I'd like to hear opinion of the others (e.g. ESA:Paulo or DLR:Bernhard) on this topic.
--
Stephan


Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: SAFE Design

We're also very interested on hearing other opinions also.

However, we fail to understand why you are now proposing the same approach that we already proposed in the TN (i.e. use of arbitrary file_ids).

IMO your new alternative is against your initial disagreements with our URN specification:

3. Section 4 a) ".....I propose to identify both packages and names by logical identifiers which are somehow meaningful. The given example in section 5.1 as well places a big burden on the mapper as it requires to map not only physical positions of packages but as well of single files. I would expect that this is over the top and that it can be solved differently, as each single file can be identified inside a package"

and it is also against the Bernhard's proposal:

4.2: I think that a “file_identifier” should be the filename. If filenames are allowed in manifests why not in logical ids. Very easy to map.

For us, the arbitrary definition of a file_id is obviously ok, but it was changed regarding the comments in this thread.

Regards

GMV Team


Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: SAFE Design

A small comment on your last remarks.
"why you are now proposing the same approach that we already proposed in the TN (i.e. use of arbitrary file_ids). "
>> Because now the concept changed, i.e. that of the mapper. You do not map files anymore through the mapper (which you proposed originally).
I mentioned that "arbitrary" just as a principal possibility, the other proposal I made was something like -
And that just enhances Bernhard's proposal. Obviously we need the version-info in the ID.
Regards
--
Stephan


Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: SAFE Design

Regarding the file-name mapping, I didn't mean to drop the mapping step by using the file-name as file-id but to obtain a simple mapping.

Stephan's proposal is fine for me.

Regards,
Bernhard


Re: Re: Re: Re: Re: SAFE Design

Dear GMV Team,

> Bernhard, please, can you confirm this? Thanks! smile

Confirmed cool

Regards,
Bernhard


Re: Re: Re: SAFE Design

Hi.

I will reply only once to this post and all the other subsequent replies.

1) On the most important remaining issue, I agree with Stephan that we should not deduct the relative physical location inside the RI package from the original URN, because this is like "cheating" or skipping a step and it may carry risks (I am thinking for example if a given file system uses a different character for separating paths - e.g. '\' instead of '/', although you can overcome it). It is maybe mainly a matter of elegance and cleanliness, but it would be better to rely on the Manifest and so the process should include reading the Manifest of the RI package to map the internal reference to the relative physical location of the files. This also gives us extra flexibility if we want to move files and/or change their names inside the RI packages.

2) In my opinion, with this design, the backwards compatibility of DFDL is irrelevant and no assumptions need to be made. If your DFDL, maybe in a newer syntax, represents the same structure, then it should be in the same RI package in a new version directory. If it is a new structure, then it should be in a different RI package as a new kind of product (even if it's actually the same product type, in the cases that Stephan mentioned).
The same is valid for metadata (although here it is less likely that you will have backwards compatibility). You can have the same binary file and different metadata and it would be possible both to view it conceptually as the same kind of product or a different product altogether. What constitutes a kind of product? That it points to the same DFDL AND METADATA RI package pair? To the exact same versions inside those packages? We don't need to define this, I believe. When you create a SAFE EO Product Package it is associated to a fixed pair of RI packages and specific versions inside them. It will be a rare event - if anything - that this will change, based on ESA's actual experience.
If you would want to create EO Products of the same kind using different DFDL (either just because of syntax, or because there's a new product specification influencing the structure) and metadata, having the versions inside the package would be just a nice way to organize them, but you could also create new RI packages. In fact, as Stephan mentioned, if no limitations are imposed, at the limit one could do away with versioning altogether (I think not only of DFDL, but even of metadata). Whenever you would want to have a new version, you would create a new RI package.

- We also don't need to assume too much about what URNs point to. They should point to something but this can be a package (e.g. RI or Auxiliary), a file or a directory. What we need to impose is that the mapper leads you to the physical location of a package based on a part of the URN (the mapper entry does not even need to include the whole URN). In fact, I liked the original proposal of having optional parts of the URN, even if with the current one I think you can still achieve the same goals. Taking a brief look at the forbidden characters you mentioned, I believe you could also save one of the simple translation steps by redefining the URNs to something like < mission_identifier >:< package_name >:< category >_< version >_< file_name > . That is, with "_" instead of ":" for the last two separators. You could use this directly in the @ID of the dataObject. I think you have almost complete freedom in terms of the NSS part of the URN.

- I think there was a misunderstanding on GMV's side about the "void pointers" to auxiliary data. If the files are "known" (the only cases when you will have "void pointers"), these "void pointers" should actually allow avoiding re-transcription of the EO Products when the auxiliary files become physically available. The "void pointer" is on the mapper's record physical part (it will be null, whatever), I still expect the URN to be properly and fully defined even in the absence of the physical package. Once the package becomes available, you just update the mapper record with the physical location, you don't touch the EO Product. For unknown auxiliary files, there is no "void pointer", there is no pointer at all.

- I was also thinking that all URNs pertaining to schemas inside the same RI package map to the same physical package, and so, depending on the mapper implementation, this will be the same, redundant, entry. So you could pre-process/strip the URN before querying the mapper. I'm not sure if all of this is outside the scope of SAFE and in particular I would like to have a clearer view of what will be contained in the Core Specs and instead in the Reccommendation for Specializations, particularly in terms of the URN definition (maybe you define it partially in the Core and each Specialization defines the remainder of the URN, but to which exact extent?), the structure of the RI packages (only in the Core?) and the function of the mapper (we said that it is a black-box, but a black-box in a standard means you have to define exactly the input and output. Is this only relevant for the Core?). I think there are some dependencies on the points above to be clarified in order to be able to answer this.

Paulo


Re: Re: Re: Re: SAFE Design

Thank you very much for all your comments clarifying the SAFE Design, they have been very much appreciated. Please, find below our summary (henceforth “new approach”) regarding the open discussion in this thread:


1. Package directory structure.

The internal files of the SAFE EO Product/EO Auxiliary/EO Collection Packages will be organised according the following structure:

< package_name > 
      < manifest_file > 
      “measurement”  
            < measurement_file(s) > 
      “metadata” 
            < metadata_file(s) > 
      “index”  
            < index_file(s) > 


The “measurement” directory is not used in EO Collection Packages, but it is mandatory for SAFE EO Product and EO Auxiliary Packages.
The “index” optional directory will be only part of the SAFE EO Product Packages.

Notation: The naming criteria for text between < > (e.g. manifest_file name) will be described in the Core Specification document. On the other hand, the text between “ “ is proposed to be fixed.


2. Versioning.

An arbitrary "Representation Information version number" directory will be used for the Representation Information Packages, and, even if only one of the schemas is updated, the new version directory should include a copy of all "previous" schemas (e.g. the version numbers: 1.0, 1.1, 1.2, etc.). It means all the files which are necessary for one version need to keep together, even if some of them actually do not change.

The internal files of the SAFE Representation Information Packages will be organised according the following structure:

< package_name > 
      < manifest_file > 
      “safe-xfdu” 
            [ version ] 
                  < file_name > 
      “conformance” 
            [ version ] 
                  < file_name > 
      “measurement” 
            [ version ] 
                  < file_name > 
      “metadata” 
            [ version ] 
                  < file_name > 
      “index” 
            [ version ] 
                  < file_name > 


Note that not all the directories will exist in all SAFE Package types. Thus:

  • For Base Schemas Packages: All the directories (i.e. “safe-xfdu”, “conformance”, “measurement”, “metadata” and “index”) are mandatory.
  • For Binary Schemas Packages: Only the “measurement” and “index” directories are mandatory, the other directories are not used in these packages.
  • For Metadata Schemas Packages: the “safe-xfdu”, “conformance” and “metadata” directories are mandatory, the other directories are not used in these packages.


Where the files included for each directory are the followings:

  • The “safe-xfdu” directory will contain the Specialised XFDU Schemas.
  • The “conformance” directory will contain the Specialised Conformance Schemas.
  • The “measurement” directory will contain the Specialised DFDL Schemas and other XML Schemas (for those files that cannot be represented with DFDL) for the EO Product or Auxiliary files.
  • The “metadata” directory will contain the Specialised OGC EOP O&M Schemas and Auxiliary Schemas.
  • The “index” directory will contain the Specialised DFDL Schemas for the index files.
  • The Base Schema Package is a particular package where the base Schemas and SAFE base Schemas will be added to each directory accordingly.


Notation: The naming criteria for text between < > will be described in the Core Specification document and the structure proposed will be part of the SAFE Standard. On the other hand, the text between “ “ is proposed to be fixed and the text between [ ] will change depending on the version.

As the directory structure including a version number approach will be adopted, then the timestamp described in the TN approach will not be needed.

With respect to the DFDL as Paulo says, “If a DFDL, maybe in a newer syntax, represents the same structure, then it should be in the same RI package in a new version directory. If it is a new structure, then it should be in a different RI package as a new kind of product." We are fully agree.


3. Logical Identifier.

The URN for the external files taking into account all your comments will be as follows:

urn: x-safe:< mission_identifier >:< package_name >:< category >_< version >_< file_id >

Where:

  • < category > can be “safe-xfdu”, “conformance”, “measurement”, “metadata” or “index”.
  • < version > is the version directory seen above.
  • < file_id > is an identifier for the file where it will be something "logical" to uniquely identify a file; it is proposed to use the name of the file without extension.
  • The indicates the optional part in the URN format.


Therefore, the physical location of a file inside a SAFE Representation Information Package won’t be resolved using the URN/mapper. Instead, the location provided by the mapper will be used to open the manifest file of the SAFE Representation Information Package to identify the corresponding @ID deduced from the URN and to obtain the location of the schema file inside the package. It means, the URN is decoupled from the physical location of the schema file inside the package.

An example of an URN is as follows:

urn:x-safe:ENVISAT:EN01_ASA_WS_0P_MTD:metadata_1.0_safe-atm-spec


In this way, the @ID attribute of the dataObject matches with this part of the URN. Please, see the following XML snippet:

 < dataObjectSection >
       < dataObject ID="metadata_1.0_safe-atm-spec" repID="safeATMBaseSchema" >
          < byteStream mimeType="application/octet-stream" >
             < fileLocation locatorType="url" textInfo="Metadata Schema Data" 
                            href="metadata/1.0/safe-atm-spec.xsd"/ >
             < checksum checksumName="MD5">93d1daca226ba957472c567a7fc33427
          < /byteStream >
       < /dataObject >
< /dataObjectSection > 


This URN format would save one additional step in the resolver translation process because the < category >_< version >_< file_id > from the URN matches with the @ID of the dataObject in the Manifest file.

Summarizing, the steps to access the schema file inside the package will be as follows:

  • Step 0. The URN is read from the Manifest e.g. EO Product Package.
  • Step 1. Locate the Representation Information Package where the external file is contained via the URN-physical mapper.
  • Step 2. Retrieve the Representation Information Package because you already known the physical location of this package via the mapper.
  • Step 3. Access the Manifest file of the Representation Information Package to find the physical reference of the schema file using the relationship between the URN and @ID of the dataObject as has been commented above.



4. URN Mappers.

According what has been discussed and the new proposal for the URNs specification (URN references to files instead of packages) only a URN-Physical mapper will be used instead.
All the URN references to the EO Auxiliary files should be specifically written in the Manifest of each EO Product Package.
Therefore, there are three possible cases:

  • The EO Auxiliary file is known and the EO Auxiliary file exists physically in the archive.
    • Action: In this case the URN could be created without problems.
  • The EO Auxiliary file is known and the EO Auxiliary file does not exist physically in the archive.
    • Action: In this case the URN could be created. However, the physical location in the mapper cannot be set. This is the “void pointer” case in the mapper as you said.
  • The EO Auxiliary file is not known. In this case the URN cannot be created.
    • Action: Only when the “unknown” auxiliary file becomes available, the EO Product Package would be updated (re-transcribed).



5. Other issues.

With respect to the redundancy issue commented by Paulo because all the URNs of the files contained in the same Representation Information Package will have the same physical package mapping … Yes, we had seen this issue about the redundancy entries in the mapper but it was not commented in the forum because for us this “pre-process/strip” more an issue about how the mapper is optimized and therefore it is outside the scope of SAFE.

And with respect to what will be contained in the Core Specs, our idea is:

  • Define the URN format in the Core Specification document. This will be included in a “linkage mechanism” section in the Core Specification clarifying and including the above points ("3. Logical Identifier" and "4. URN Mapper").
  • Define the SAFE Packages’ structure in the Core Specification document following the proposed package directory structure and versioning ("1.Package directory structure" and "2. Versioning").
  • Define the mapper like a black-box, but this will be defined inside an Annex in the Core Specification document.


We hope we have clarified all previous issues.

GMV Team.


Re: Re: Re: Re: Re: SAFE Design

Dear GMV Team,

this version looks fine for me. Only one point left to clarify:
> 5. Other issues.
> With respect to the redundancy issue ...
I don't see the redundancy issue anymore if the mapper only maps the package URN to its physical location.
In my understanding e.g. for
urn:x-safe:ENVISAT:EN01_ASA_WS_0P_MTD:metadata_1.0_safe-atm-spec
a mapping would only be created from
urn:x-safe:ENVISAT:EN01_ASA_WS_0P_MTD to its physical location.
But I also see this as a mapper implementation issue. The important thing is that the core spec says where the package part of a URN ends and the ID part used in the manifest begins (what your proposal does).

Thanks and kind regards,
Bernhard


Re: Re: Re: Re: Re: SAFE Design

This is fine for me.

As Bernhard said, the Core Specs should say where the package part of the URN ends and which part of the URN you use as ID in the Manifest of the RI packages. Then, I think on the Recommendation for Specializations you should say that each specialisation has to define the content of the " < mission_identifier >:< package_name > " part of the URNs, because this is what will be specialisation-specific. Maybe you can give some guidelines on how to assign these names.

Paulo



Re: SAFE Design

Fine for EUMETSAT.

Though, I have a few (minor) editorial comments to the description of the steps in your section 3 Logical Identifiers:
a) This applies not only to schemas and RI packages, but as well to anything externalized from any package (e.g. aux data wrt. EO data).
b) "because you already known the physical location". Depends... :-) I am picky here, but the real physical location might be still hidden through the API of the archive system.
--
Stephan


Re: Re: SAFE Design

Thank you very much again for all your comments clarifying the SAFE Design. Please, find attached a zip file including three Manifest example files following the SAFE Design agreed in this thread. These files are:

  • Manifest_example_eo_product.xml. Manifest file of a SAFE EO Product Package.
  • Manifest_example_ri_bin.xml. Manifest file of a SAFE BIN Representation Information Package. Note that the DFDL name has been replaced by BIN in the Package's name.
  • Manifest_example_ri_metadata.xml. Manifest file of a SAFE Metadata Representation Information Schemas Package.


Please, note that the names used in the logical identifiers and files not correspond to a real product. However, we think that these examples clarify all the ideas concluded in this thread.

All the comments are welcome. Thank you very much for all your effort.

GMV Team.




The original document is available at https://wiki.services.eoportal.org/tiki-view_forum_thread.php?comments_parentId=1512&topics_offset=1&display=&fullscreen=&PHPSESSID=