-1

Update: Context is MuleSoft and could any libs be used to solve scenarios like this.

I have an unusual requirement in that I need to accept 'Incorrect XML' within an API implementation and essentially correctly escape any control characters in areas of the XML where they should not be, i.e in attributes or on the element data, of which they can occur anywhere.

This is to prevent APIKit/Schema validation errors initially, as well as further DW transforms that will expect valid XML.

Tried to portray a simple example below:

<CARS>
  <CAR>
    <MODEL ALIAS="City & Co">alpha city</MODEL>
    <YEAR>1992</YEAR>
    <MANAFACTURER>Penguin</MANAFACTURER>
    <OTHER>Made in UK & US</OTHER>
  </CAR>
  <CAR>
    <MODEL ALIAS="City & Co" MAKE="BMW">venturi city</MODEL>
    <YEAR>1994</YEAR>
    <MANAFACTURER>Penguin</MANAFACTURER>
    <OTHER>BHP > 1000</OTHER>
  </CAR>
</CARS>

Is there any easy to parse XML in DW or external lib and essentially correctly escape control characters like & and < >?

user1905307
  • 47
  • 1
  • 8
  • Some TagSoup or HTML parsers might work but I have no idea whether or how you can use them in your context. – Martin Honnen Aug 06 '21 at 17:28
  • For instance, https://xsltfiddle.liberty-development.net/3MP42Ns uses David Carlisle's XSLT 2 implementation of a tag soup parser to parse your markup into XML. To fed it to the XSLT, there I have wrapped it as a CDATA section into an input element but you could run with a named template and pass in the file URI and use `unparsed-text` or pass in the content as a string parameter. – Martin Honnen Aug 06 '21 at 17:34
  • It's not an uncommon problem but it's a very difficult one. People who generate invalid XML need to recognise that they are making life very difficult for their users - it's like giving people a lamp fitting that only works with non-standard light bulbs. – Michael Kay Aug 06 '21 at 18:38
  • I don't think it is fully fair to close the question because other solutions mentioned don't specifically target Mule which is in the scope of the question. – aled Aug 07 '21 at 00:14
  • Thanks for your inputs all. It would seem to suggest that perhaps a third party library like TagSoup may need to be explored, but not sure if thats trivial with MuleSoft. Also, I agree Aled, it would of been good to keep this open as I agree, as this might be the first time I’ve faced this issue, I’m sure it’s not uncommon and would be useful to hear others thoughts. – user1905307 Aug 07 '21 at 06:15

1 Answers1

0

Whatever is generating that, is not generating XML, but a string with similar formatting than XML. The thing is that no standard compliant parser will parse invalid XML like the example provided. You can try to hack it with string manipulation in DataWeave, Groovy, Java, etc. but not as XML until special characters are correctly escaped. It's difficult to cover all possible cases in that way. Maybe it would be easier to enclose each value in a cdata section.

The real solution would be to generate valid XML at the source.

aled
  • 21,330
  • 3
  • 27
  • 34
  • Agree, trying to hack with string manipulation would be a nightmare. The ‘real’ XML I’m dealing with here is far more complicated in its structure than the simple example I shared above. I’ve outlined my thoughts it’s the wrong technical solution for this not to be resolved at source many times. – user1905307 Aug 07 '21 at 06:17