0

I am currently using SSIS on a project where I need to verify the correct XML file structure. In particular, I have to check that there is no missing tag in the XML file and if so, I have to reassemble this line without tag. I'll give you an example to better understand.

<?xml version="1.0"?>
<catalog>
<DATA>0000000061E82D821590010000409525CD</DATA> 
<DATA>0000000061E82D8C163001000140AD0DF6</DATA> 
<DATA>0000000061E82D9616E301000240776CAB</DATA>
<DATA> 0000000061E82DA0178001000340C56B6</DATA> 
<DATA>0000000061E82DAA188001000440C0C7CB</DATA>
 0000000061E82DDAEA4001000540BB9A276
</catalog>

For example in the above XML there is a <DATA> tag missing. I have no influence on the creation of the XML. How could I notice that a <DATA> tag is missing (the number of data lines is not fixed), and subsequently retrieve that line where there is no tag ?

For example in the above xml there is a <DATA> tag missing. I have no influence on the creation of the XML.

The solution can be a suite of SSIS components or a c# script.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Dexter
  • 21
  • 3
  • [XML Schema validation](https://learn.microsoft.com/en-us/dotnet/standard/data/xml/xml-schema-xsd-validation-with-xmlschemaset). => or with [Linq 2 Xml](https://learn.microsoft.com/en-us/dotnet/standard/linq/validate-xsd). – Fildor Jan 30 '23 at 11:46
  • _"I have no influence on the creation of the XML."_ - You can use the linked tech to validate the XML. _But_ I'd strongly advise to _not_ try and "fix" broken XML. Just reject the input. Have the source fix its output. Even if that means _you need to push back_. And don't let things count like "Yeah but it's _obvious_ that it's just a missing ... " - No. The input is invalid. Period. – Fildor Jan 30 '23 at 11:52
  • @Fildor has a good point, but sometimes you're stuck with the input you're given and recalcitrance will just make trouble for you. In that case, if the schema is known, you could write a state-machine-based parser (or, possibly, even find one out there) that can kick out tags missing their closing tags. – Ann L. Jan 30 '23 at 11:58
  • *Actually*, if your XML has line breaks, and each tag is on a single line, couldn't you just filter out any lines that don't end in ">"? Or is the real XML too complex for that? – Ann L. Jan 30 '23 at 12:00
  • _"couldn't you just filter out any lines that don't end in ">"?"_ - I really think that would make things worse: Then you are silently dropping input data. – Fildor Jan 30 '23 at 12:01
  • I should have said "filter in" rather than "filter out!" I meant, as the questioner said, retrieve the lines that are missing closing tags. – Ann L. Jan 30 '23 at 12:02

1 Answers1

1

It is impossible to automatically correct invalid XML in the general case.

Terminology correction

For example in the above XML there is a <DATA> tag missing.

There is not a <DATA> tag missing. You probably mean that there are supposed to be begin and end DATA tags surrounding 0000000061E82DDAEA4001000540BB9A276. The difference is significant because if there were only a single tag missing, the "XML" would not be well-formed. If a schema says that a catalog element may only have DATA children, then the XML is not valid.

Don't try to automatically correct invalid XML

Best practice is to reject the input and force the sender/creator to fix the document. The entire raison d'être for a schema is to express the invariants that can be relied upon to process the data. Violating those invariants means all bets are off.

Don't be seduced by the superficial simplicity of peep-hole repair ideas

Every repair idea implies an assumption about the data that is not expressed in the schema, which would be bad because:

  • There should be a clearly and explicitly expressed definition of validity and
  • The assumptions
    • will likely not be expressed unambiguously.
    • may not be expressed at all.
    • may be incomplete or entirely incorrect.
    • will probably go unconfirmed because an errant producer that can/will not fix validity against a schema is unlikely to be able to assess the validity of an assumption over all data that it is, or could be, sending over all time.

See also

kjhughes
  • 106,133
  • 27
  • 181
  • 240