Forum: LTDP SAFE

Representation information using XML Schema

According the SRR outcomes it was required an analysis to consider the feasibility of using an alternative approach to the binary data representation. This new approach should be based on instances of a common general XML schema, instead of binary data representation using a single SDF schema as it is nowadays being used by SAFE.

At this level, the first analysis to be done (before provide a solution) should be to confirm if a schema validation for the representation language is really needed for SAFE. In that sense, the attached document “Representation Information in XML Trade-off” (PDGS-SAFE-GMV-TN-12-0066) provides some conclusions not recommending this approach.

This is the summary of the conclusions (please, have a look to the document for further details):

- As the binary/text file within a SAFE product to be represented is not and XML file, there is no need to assume a XML standard validation mechanism.

- Existing languages (e.g. SDF, DFDL, …) needed to specify the representation information schemas are W3C compliant (the are well-formed and make a correct usage of annotation field).

- Syntactic validation for the representation information schema can be implemented with specific existing tools for the representation schemas described for example in DFDL or SDF (there is no need to have a general common schema).

- Semantic validation would be still needed for the binary/text files stored within a SAFE product.

- A common general XML schema won’t provide relevant added value and would have less expressive power respect to the languages already available.

- The current approach implemented by SAFE eases the conversion of the information contained in a file (binary or text) into a corresponding XML document (simplifying the implementation of APIs similar to DOM and SAX).

This topic discussion and the analysis result presented in the attached document are intended to reach a consensus before the PDR-C because the final solution may imply a change in the format design that has to be considered for the SAFE Core Specification update.

All your comments will be appreciated.


Adrián Sanz (GMV)
LTDP SAFE Project Manager


Re: Representation information using XML Schema

Detailed feedback has been provided by Stephan Zinke (EUMETSAT Panel member) to be used as item for discussion on this topic.

The attached document provides a different point of view on the trade-off written by GMV.



Re: Representation information using XML Schema

Thank you for your memorandums Stephan. I would like give you our point of view about your comments:


Stephan's Comment:
''>The question was in my understanding the following: Is it possible, instead of using xml schemas for the >definition, to use instances of schemas, i.e. xml documents, for the definition.
>
>The advantage would be to be able to verify the validity of those xml instances by
>applying a generic schema.
>
>Moreover, it was assumed, that the validation of binary/text files could be easier by using xml instances
>(documents) instead of schemas. Ultimately, as the binary/text files are not XML, standard xml validation >mechanism cannot be applied.''

GMV's Answer:
Probably the following piece of text extracted from the “BinX Developer’s Guide“ which is the precursor of DFDL language , could throw some light on this:

“Experiments suggest that a fully decoded and tagged XML representation of a complex binary file could take as much as four times the space of the binary original, partly because all binary data would need be rendered into a textual representation, and partly because of the need for textual markup to denote each individual occurrence of a data element or structure. Because the BinX language description of a binary file is separate from the original rather than being embedded within it, it is possible to avoid this potentially enormous overhead. Many common or repeated elements can be defined once rather than repeatedly. Furthermore the original binary data can remain unchanged in the binary file, supporting any existing data access required.“

(http://www.edikt.org.uk/binx/docs/BinXDevGuide.pdf)

Without going in further analysis right now, is it really needed to validate a data object stored in an AIP? Can’t we assume that this validation is performed at the producer side?

In our understanding, what it is really needed is a validation on the XML Schemas to assure that the representing information is syntactically correct according the language rules. In that sense, Schematron (as standard ISO/IEC 19757-3:2006 ) can be used to validate the current SAFE schemas.


Stephan's Comment:
>Additionally, it was assumed that it might be easier to maintain a set of instances of XML documents rather >than a set of xml schemas.
>
>Currently, for each new definition (product type), a new schema need to be defined, and the products are then >instances of that definition.

GMV's Answer:
Is it really easier to maintain a combination of “general XML Schema+XML documents” rather than a set of “XML schemas” to represent Binary data?

With your approach, a new definition (product type) would require a new XML document as well because as we mentioned in our trade-off, the general XML schema would be very general. So we don’t see a clear advantage on this.


Adrián Sanz (GMV)
LTDP SAFE Project Manager


Re: Re: Representation information using XML Schema

Thanks Adrian, for your feedback on my comments.
Some more comments below:

> GMV's Answer:
> Probably the following piece of text extracted from the “BinX Developer’s Guide“ which is the precursor of DFDL language , could throw some light on this:
>
> “Experiments suggest that a fully decoded and tagged XML representation of a complex binary file could take as much as four times the space of the binary original, partly because all binary data would need be rendered into a textual representation, and partly because of the need for textual markup to denote each individual occurrence of a data element or structure. Because the BinX language description of a binary file is separate from the original rather than being embedded within it, it is possible to avoid this potentially enormous overhead. Many common or repeated elements can be defined once rather than repeatedly. Furthermore the original binary data can remain unchanged in the binary file, supporting any existing data access required.“
>
> (http://www.edikt.org.uk/binx/docs/BinXDevGuide.pdf)
>

In my understanding it was never a question if the representation information is embedded or not.


> Without going in further analysis right now, is it really needed to validate a data object stored in an AIP? Can’t we assume that this validation is performed at the producer side?

In my understanding it is a SAFE requirement (of the toolbox) to be able to validate the content of the AIP.

> In our understanding, what it is really needed is a validation on the XML Schemas to assure that the representing information is syntactically correct according the language rules. In that sense, Schematron (as standard ISO/IEC 19757-3:2006 ) can be used to validate the current SAFE schemas.

Correct, that was my understanding as well. And I was under the assumption that it is easier to validate an XML document rather than an XML schema. But ultimately, as well schemas follow rules and can be validated as they are XML documents themselves.
The validation of an XML document can be done against a specific schema, while IMO a schema can be validated only against the specific schema rules (if using schema validation and not other methods, like schematron).


> GMV's Answer:
> Is it really easier to maintain a combination of “general XML Schema+XML documents” rather than a set of “XML schemas” to represent Binary data?
>
> With your approach, a new definition (product type) would require a new XML document as well because as we mentioned in our trade-off, the general XML schema would be very general. So we don’t see a clear advantage on this.

This needs further discussion, IMO.


Nevertheless, the outcome of the other studies seems to suggest DFDL, which is based on schemas :-), so maybe a further discussion on this topic is superseded by the other...

Stephan Zinke, for EUMETSAT



Re: Representation information using XML Schema

>In my understanding it was never a question if the representation information is embedded or >not.

I agree, this it is not a problem of embedding or not the representation information, but my answer tried to put under your attention the complexity of the XML document needed to represent a binary file.
In my opinion, such XML document would be harder to maintain than a simple XML schema (e.g. DFDL) and there are other studies supporting this information as for example “Data Format Description Language: Lessons Learned” (Robert E. McGrath; NCSA; September 2011):

''It is conceptually possible to map almost any data structure to an equivalent XML structure, and to define an XML schema to define valid XML that can be translated to a given non—XML format.
However, developing these mappings and related software often is a very labor—intensive process, and maintaining a plethora of readers, each useful for a small set of cases, is difficult and may ultimately be unsustainable.

These efforts demonstrated the concepts, and the implementations were successful within limited uses.
From its inception, the DFDL working group sought to generalize and improve these efforts.''


>In my understanding it is a SAFE requirement (of the toolbox) to be able to validate the content of the AIP.

I just was wondering if this requirement is really needed (probably yes, but would like to hear other opinions)



>Correct, that was my understanding as well. And I was under the assumption that it is easier to validate an XML >document rather than an XML schema. But ultimately, as well schemas follow rules and can be validated as they >are XML documents themselves.
>The validation of an XML document can be done against a specific schema, while IMO a schema can be validated >only against the specific schema rules (if using schema validation and not other methods, like schematron).

In my opinion, the benefits of having a general XML Schema + a complex XML document just to assure a standard W3C validation, goes against the maintainability (these complex XML documents should be created for each product type).

With the schema approach (SDF/DFDL) it is possible to develop a general parser (or reuse the already implemented ones) to create a XML document from a schema. Thus the representation could be also validated using a standard validation.



>This needs further discussion, IMO.
>Nevertheless, the outcome of the other studies seems to suggest DFDL, which is based on schemas , so maybe >a further discussion on this topic is superseded by the other...

I think this forum is the best place to continue with the discussion so feel free to include your point of view (well....this is not only for you smile)

Thank you for your valuable comments Stephan!


Adrián Sanz (GMV)
LTDP SAFE Project Manager



Re: Representation information using XML Schema

Stephan's Comment:
>>In my understanding it is a SAFE requirement (of the toolbox) to be able to validate the content of the AIP.

GMV's Answer:
>I just was wondering if this requirement is really needed (probably yes, but would like to hear other opinions)

Riccardo's Opinion:
I strongly agree with Stephan's Comment, so I believe that the validate requirement is needed.

In agreement with the clear analysis made in PDGS-SAFE-GMV-TN-12-0066_Rep_Info_XML.pdf I believe Data Format Description Language (DFDL) seems to be the best alternative for SAFE.
At this point is not completely clear to me how and if DFDL is compatible with SAFE capability of validate the content of the AIP.

Looking forward to see more opinion wink



Re: Representation information using XML Schema

Thank you for your point of view Riccardo,

At this point, I would like to say that current language (SDF) provide this validation capability through a specific tool (parser) called DRB (Data Request Broker). Considering that DFDL is very similar to SDF, it is possible to develop a tool to provide the same validation capabilities (at least) that it is being provided nowadays by DRB.

So I think that the validation is more related to an implementation issue rather than a potential capability.


Adrián Sanz (GMV)
LTDP SAFE Project Manager



Re: Representation information using XML Schema

I agree with Stephan that the relevant question for the trade-off was not correctly understood initially. In fact, the discussion in this forum is more useful and focused on the real issue than the trade-off document as it is.

Picking up from my comment in the other thread, the idea for this trade-off came from a RID from Dominic Lowe and was "only" about whether instead of having new schemas for each new product type, we could have a very generic schema able to represent all possible product types (quite challenging and, I agree, of dubious added value because it would have to be really generic) and then instances of this generic XML schema for each new product type.

The validation aspect only comes into play after this, although it then has an impact on the decision. But the trade-off document should be reworked to focus on the decision to have N schemas or 1 common/generic schema and M XML instances with the representation information, and which of these approaches is better.

Maybe I'm wrong but I think in both cases you end up with solutions that you can only fully validate up to a certain point using regular XML Schema validation (not Schematron). In both cases you still have limitations and won't be able to fully validate. It could be that this is a non-issue, but again this just means that the fundamental trade-off is not concerned with the validation but rather with whether the representation information for a given product type is an XML schema (".xsd") or an XML instance (".xml").

On an unrelated note, there is a typo in the acronyms section. SAFE is "standard archive...", not "satellite archive...".

Paulo



Re: Representation information using XML Schema

I have one more comment on the XML vs. XSD:
One drawback of the XSD-annotations-schematron approach is - IMO - that the content of these definitions (i.e. the annotations) cannot be easily verified (if somebody knows this, please let me know). I.e., annotations will not be looked at when the validity of a schema is looked at (by generic schema-schema validation) - annotations provide in pure XSD additional information not further considered. Using those for different purposes is actually - IMO - a wild hack.
Thus, it might be a complex task to create the (correct) schemas themselves.
If I look at how schematron works, it is as well "weird" to embed the rules into the schema. Usually one would do this in separate schematron rule files!
It would be propbably worth to dedicate some more effort into the pros and cons of all this.
--
Stephan



Re: Representation information using XML Schema

Dear Adrian and Hector,

We agree with Stephan and Paulo that the real question for this trade-off is not focused and then the final approach is not clear.

Anyway regarding this issue, we have been discussing the possible implications of a common general schema and it seems that the final file would get much more complicated than expected and would have a little added value for SAFE. Even more, for future missions it seems to be easier having an XML schema for each product type rather than a very generic schema plus instances of this generic XML schema for each product type.

Another comment is to answer one of your open questions: Is it really needed to validate a data object stored in AIP? We think that it is important to assure the data integrity. Although SAFE converters and I/O library shall be able to manage missing and corrupted elements, it would be interesting to validate the AIP.

Maria




The original document is available at https://wiki.services.eoportal.org/tiki-view_forum_thread.php?comments_parentId=1163&topics_offset=8&topics_sort_mode=userName_asc&display=&fullscreen=&PHPSESSID=