2

I have a challenging situation. ECHA publishes the schema for their IUCLID product as zipped collections of xsd files. I want to be able to import data into SQL and use those xsd files to build the SQL tables.

The trouble is while they publish a structure showing an ERD style relationship layout - https://poisoncentres.echa.europa.eu/documents/1789887/5577602/pcn_format_data_model_en.pdf/d667afb6-a36b-4596-48dc-3b2de436d3de?t=1635233313478 - the xsd does not have any sign of those relationships.. (though I might simply be missing them)

The xsd are found in here - https://poisoncentres.echa.europa.eu/documents/1789887/10223884/PCN+Format_v4.0.zip/7d7641c0-facc-898c-bffd-45b080bfdb75?t=1635502393893 and the general page is here - https://poisoncentres.echa.europa.eu/poison-centres-notification-format

I think my option is to do it the long way by converting xsd into SQL (pref Mysql dialect) via maybe a json schema and then hand stitching things together with the hope that there are no huge differences when they release a new version - but this feels icky :)

Does anyone have any thoughts on a better method?

EDIT: So I also noticed this https://poisoncentres.echa.europa.eu/documents/1789887/6428404/echa_example_1_dossier.i6z/98873135-5373-d2df-0cc2-9177a403cad0?t=1590667188569 which is an example PCN document.. this at least gives relationships between chunks of xml, but I'm not sure these will always be available on new versions of the schema.. and boy its painful to parse..

baradhili
  • 514
  • 1
  • 7
  • 27
  • Which language(s) are you able to use for reading the XSDs and emitting the equivalent SQL? – kimbert Nov 20 '21 at 16:28
  • wide open - I'd prefer not to use java or .net as I'm not that good at it.. – baradhili Nov 21 '21 at 04:03
  • I think you have a decision to make. Do you a) invest a lot of time to implement a technically rigorous solution that can handle anything that might occur in the XSD model (both now and in the future) or b) implement a solution that handles the current version of the XSDs, and tweak it in future if necessary Some libraries that load XSDs are focused only on the task of _validating XML documents_. But option a) might require a library that can load an XSD _and allow general-purpose exploration of the XSD model_. Such libraries do exist, but there are not many, and not in every language. – kimbert Nov 21 '21 at 20:38
  • Yes, I'm feeling my solution is to document the steps taken to build the structure from the existing XSD/examples and also commit the existing XSDs so its easy to find the changes.. and hopefully along the way potentially build tooling to convert later :) - I will probably document my steps here too.. – baradhili Nov 22 '21 at 00:41

1 Answers1

0

So I'm going to steadily update this as I progress..

The first challenge is to bring all these XSDs into a single file so we can toss what is not being used and throw it into things like xsd2xml for generation of sample XML files..

How could I merge an XSD schema with imports and includes into a single file? Covers this - but much of what is mentioned is gone.

This - The 'minOccurs' attribute cannot be present looks tantalizing, but as usual all gone.

This looks the thing - https://sourceforge.net/projects/graphvisu/ but only wants HTTP - doesn't like file

Managed to get xsdwalker running - which shows up a problem.. it seems that the set of xsds doesn't really fit cleanly into a hierarchy.. which means we need to run with the sample xmls instead

Update:

Stupid me - the format uses href:xlink to reference other portions - but this isn't an XML cross reference, its just a href - so I am manually replacing the "name": { "xlink:type": "simple", "xlink:href": "52a44784-64a6-4836-b8d6-b84315fd958e_f53d48a9-17ef-48f0-8d0e-76d03007bdfe.i6d", "content": "Ethane-1,2-diol" },

with the expanded i6c:Document format - this is not pretty at all, and once I've coded something then I'll be cherry picking out what is needed... also as a complete facepalm and abuse of the XML format - almost everything in the XML i6c document is coded values - even though the idea of wasting all that bandwidth with readable variable names - is to have readable variables as well.. FFS

I compromised and inserted a "contents' key holding the referenced file contents - see Searching an XML structure but modifying a node higher in the hierarchy

baradhili
  • 514
  • 1
  • 7
  • 27