1

I am capturing lessons learned about data I/O from a data analyst's perspective, without the benefit of data engineering expertise (and being quite explicit about that shortcoming). In order to give context to the various alternatives, taking into consideration the constraints within my shop, I've experimented briefly with XML import/export, and done online reading about schemas. One thing I noticed about an open source utility for a 4th generation language environment is that seems to use a default (I haven't specified one):

<?xml version="1.0" encoding="utf-8"?>
<y>
   <DataFrame1>
      <DataFrame1_Field1>[75;75;75;75;75;75;75;75;75;...;75;75]</DataFrame1_Field1>
      <DataFrame1_Field2>[2014;2014;2015;2015;2016;2016;...;2083;2084;2084;2085;2085;2086;2086]</DataFrame1_Field2>
      <DataFrame1_Field3>
         <item>ABC</item>
         <item>DEF</item>
      <...snip...>
         <item>00-00</item>
         <item>00-00</item>
         <item>00-00</item>
      </DataFrameP_FieldM>
      <DataFrameP_FieldN>[2;2;4;2;5;3;5;3;3;1;5;5;...;4;5;3;3;2;4;2;1;2;4]</DataFrameP_FieldN>
   </DataFrameQ>
   <DataFrameR>
      <DataFrameR_Field1>[75;75;75;75;75;75;...;75;75;75;75;75]</DataFrameR_Field1>
      <DataFrameR_Field2>[1;2;3;4;5;6;7;...;1638;1639;1640;1641;1642]</DataFrameR_Field2>
      <DataFrameR_Field3>[0;0;0;0;0;0.014925;0.223881;0.014925;...;0;0.059701;0;0;0;0;0;0;0.626866]</DataFrameR_Field3>
   </DataFrameR>
   <DataFrameS>
      <DataFrameS_Field1>[75;75;75;75;75;75;...;75;75;75;75;75;75;75]</DataFrameS_Field1>
      <DataFrameS_Field2>[1;1;1;1;1;1;1;...;1642;1642;1642;1642;1642]</DataFrameS_Field2>
      <DataFrameS_Field3>[0;0;0;0;0;0;0;0;...;7;0.7;0.7;0.8;0.8;0.8;0.9;0.9;1]</DataFrameS_Field3>
      <DataFrameS_Field4>[0;0.1;0.2;...;0;0.1;0.2;0;0.1;0]</DataFrameS_Field4>
      <DataFrameS_Field5>[1;0.9;0.8;...;0.3;0.2;0.1;0;0.2;0.1;0;0.1;0;0]</DataFrameS_Field5>
      <DataFrameS_Field6>[0;0;0;0;0;0;...1;1;1;1;1;1;1;1;1;1]</DataFrameS_Field6>
   </DataFrameS>
</y>

Interpreting the labels: All labels starting with the string "DataFrame..." are anonymizations I made in the code. Before anonymization, DataFrameX (where X is any alphanumeric character) was the name of a data frame objects in my 4GL environment [1]. All labels containing the string "DataFrame" and "Field" are also anonymizations. Before anonymization, they were the names of fields within data frames. The label <y> is just the object name of the collection of data frames in the 4GL environment.

The arrangement of the data all makes sense to me, knowing what I do about the data frames from which the data come. All the taggings makes sense. I assumed that they come from a generic default schema. However, my web searching has not revealed any indication that such a default schema exists, much less has been agreed/standardized upon. Is there such a generic default, or is these tags the result of the export utility's author?

[1] The 4GL environment is Matlab, but my question is about XML practices & conventions rather than Matlab.

user2153235
  • 388
  • 1
  • 11

1 Answers1

1

There is no default XML schema for an arbitrary XML file. There are rules of well-formedness given by the W3C XML Recommendation, but those define XML itself rather than the vocabulary and grammar of any given XML schema.

Identifying an XSD when none is specified

  1. When schemaLocation is specified in the XML, see the XSD specified there. For more on schemaLocation, see How to link XML to XSD using schemaLocation or noNamespaceSchemaLocation?
  2. When only a namespace is used, see How to locate an XML Schema (XSD) by namespace?
  3. When the provider of the XML is available, ask or inspect the source/documentation.
  4. When relatively unique/informative element names are used, or if you know the sector/industry google element names or sector/industry and "xml schema".

If none of the above work, go schema-less, or write your own to fit the data.


More on XML design

In the comments, @user2153235 asks:

Is there a prevailing practice (or even a universal, minimal "base" scheme that is defaulted to in the absence of an explicit schema) wherein the atomic element is "item", and any other tag represents an element that is either a string or a structure composed of subordinate elements?

Yes, there is a prevailing practice.

Answer to the question: No, there is no universal, minimal "base" schema – just the rules of well-formedness for XML itself.

The XML in your post is poorly designed:

  • Naming is terrible:
    • The root element is named y, yet the content is clearly not a simple y-coordinate or anything else that could be reasonably be described as y.
    • DataFrame-based names have C character suffixes followed by _FieldN numeric suffixes. Unless the C character is meaningful in some domain, the abbreviation ought to be expanded. Hard-wired numerical suffixes on list members are better left implied by position so that the name can lexically signal type without having to decompose.
  • Substructure is left unmarked up: Generally, structure shouldn't be buried in micro-formats within strings; mark-up should be imposed so that the XML parser can be leveraged rather than having to implement micro-parsers within an application.
user2153235
  • 388
  • 1
  • 11
kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thanks, kjhughes. The utility is at the [Matlab file exchange](http://www.mathworks.com/matlabcentral/fileexchange/12907-xml_io_tools) and the code for the `xml_write` function is [here](https://viewer.mathworks.com/?viewer=plain_code&url=https%3A%2F%2Fwww.mathworks.com%2Fmatlabcentral%2Fmlc-downloads%2Fdownloads%2Fsubmissions%2F12907%2Fversions%2F14%2Fcontents%2Fxml_write.m). – user2153235 Dec 13 '19 at 01:16
  • But I wasn't really trying to reverse engineer the function nor find a solution for when I don't have a schema. I was just wondering whether there is a convention (defacto or explicit) about the tags used for nested objects when there is no schema. Can I conclude from your answer that it is "no"? If so, I'm wondering if you can add that to your answer. Thanks. – user2153235 Dec 13 '19 at 01:16
  • Hmm, could you elaborate on what you mean by *convention (defacto or explicit) about the tags used for nested objects when there is no schema*? Are you talking about naming conventions? When to use elements vs attributes? Something else? – kjhughes Dec 13 '19 at 01:25
  • My understanding of schema comes from [this tutorial](http://www.brainbell.com/tutorials/XML/TOC_Using_XML_Schema.htm) and the constituent lessons that follow. It resembles the definition of a hierarchical "struct[ure]" in strongly typed language, i.e., a collection (ordered or not) of predefined constituent data elements. When you actually supply the data that corresponds to the schema, it must follow the structure and the elemental names/types. The tags in the actual data (the labels in the angle brackets) must match the nanes of elements in the schema. – user2153235 Dec 13 '19 at 04:14
  • In my original post, the atomic element seems to be either an "item" or a square bracketed array of numbers. Is there a prevailing practice (or even a univeral, minimal "base" scheme that is defaulted to in the absence of an explicit schema) wherein the atomic element is "item", and any other tag represents an element that is either a string or a structure composed of subordinate elements? (The "item" element seems to be alphanumeric). – user2153235 Dec 13 '19 at 04:15
  • 1
    I've updated the answer to address the follow-up questions in your comments. – kjhughes Dec 13 '19 at 13:27
  • The tag names are the object names within the 4GL environment. I now explain this in the question. You said there is a prevailing practice for good XML but no default schema. It seems that **your comment about going schema-less or writing one to fit the data is the answer to my question?** This means that the XML importer needs to know the exporter logic to properly interpret the content between the tags. In this case, they seem to be either strings that are missing the adorning quotes or Matlab expressions that need to be interpretted by the Matlab interpreter. – user2153235 Dec 13 '19 at 16:25
  • My understanding of your exact question keeps shifting with each of your updates and follow-up comments, so I'm not sure I could say what my answer is at this point. Unless you have a very specific follow-up to bring this to a conclusion now, I'm going to have to move on. Good luck. – kjhughes Dec 13 '19 at 16:36
  • I think you can move on, kjhughes. Your answer is in "No, there is no universal, minimal "base" schema...". My updates was to explain the source of the tags, because it might change your auxiliary comment about poor naming. I have to websearch a bit more about substructures to understand the final bullet, as I haven't quite understood what they are from quick googling. Thanks for your answer! – user2153235 Dec 13 '19 at 20:50