I am capturing lessons learned about data I/O from a data analyst's perspective, without the benefit of data engineering expertise (and being quite explicit about that shortcoming). In order to give context to the various alternatives, taking into consideration the constraints within my shop, I've experimented briefly with XML import/export, and done online reading about schemas. One thing I noticed about an open source utility for a 4th generation language environment is that seems to use a default (I haven't specified one):
<?xml version="1.0" encoding="utf-8"?>
<y>
<DataFrame1>
<DataFrame1_Field1>[75;75;75;75;75;75;75;75;75;...;75;75]</DataFrame1_Field1>
<DataFrame1_Field2>[2014;2014;2015;2015;2016;2016;...;2083;2084;2084;2085;2085;2086;2086]</DataFrame1_Field2>
<DataFrame1_Field3>
<item>ABC</item>
<item>DEF</item>
<...snip...>
<item>00-00</item>
<item>00-00</item>
<item>00-00</item>
</DataFrameP_FieldM>
<DataFrameP_FieldN>[2;2;4;2;5;3;5;3;3;1;5;5;...;4;5;3;3;2;4;2;1;2;4]</DataFrameP_FieldN>
</DataFrameQ>
<DataFrameR>
<DataFrameR_Field1>[75;75;75;75;75;75;...;75;75;75;75;75]</DataFrameR_Field1>
<DataFrameR_Field2>[1;2;3;4;5;6;7;...;1638;1639;1640;1641;1642]</DataFrameR_Field2>
<DataFrameR_Field3>[0;0;0;0;0;0.014925;0.223881;0.014925;...;0;0.059701;0;0;0;0;0;0;0.626866]</DataFrameR_Field3>
</DataFrameR>
<DataFrameS>
<DataFrameS_Field1>[75;75;75;75;75;75;...;75;75;75;75;75;75;75]</DataFrameS_Field1>
<DataFrameS_Field2>[1;1;1;1;1;1;1;...;1642;1642;1642;1642;1642]</DataFrameS_Field2>
<DataFrameS_Field3>[0;0;0;0;0;0;0;0;...;7;0.7;0.7;0.8;0.8;0.8;0.9;0.9;1]</DataFrameS_Field3>
<DataFrameS_Field4>[0;0.1;0.2;...;0;0.1;0.2;0;0.1;0]</DataFrameS_Field4>
<DataFrameS_Field5>[1;0.9;0.8;...;0.3;0.2;0.1;0;0.2;0.1;0;0.1;0;0]</DataFrameS_Field5>
<DataFrameS_Field6>[0;0;0;0;0;0;...1;1;1;1;1;1;1;1;1;1]</DataFrameS_Field6>
</DataFrameS>
</y>
Interpreting the labels: All labels starting with the string "DataFrame..." are anonymizations I made in the code. Before anonymization, DataFrameX (where X is any alphanumeric character) was the name of a data frame objects in my 4GL environment [1]. All labels containing the string "DataFrame" and "Field" are also anonymizations. Before anonymization, they were the names of fields within data frames. The label <y>
is just the object name of the collection of data frames in the 4GL environment.
The arrangement of the data all makes sense to me, knowing what I do about the data frames from which the data come. All the taggings makes sense. I assumed that they come from a generic default schema. However, my web searching has not revealed any indication that such a default schema exists, much less has been agreed/standardized upon. Is there such a generic default, or is these tags the result of the export utility's author?
[1] The 4GL environment is Matlab, but my question is about XML practices & conventions rather than Matlab.