0

How can I have a pattern that ignores html within an element rather than the validator trying to validate it

<stuff>
   <data>
      this is some text <b>with the odd</b> bit of html<p>and unclosed tags
   </data>
</stuff>

This isn't valid but I tried things like

datatypes xs = "http://www.w3.org/2001/XMLSchema-datatypes"
start = stuff

stuff = element stuff
{
   element data { * }
}
Adrian Cornish
  • 23,227
  • 13
  • 61
  • 77

2 Answers2

1

You can't allow arbitrary unmodified HTML within XML. Either escape the individual special characters (What are the official XML reserved characters?) or encapsulate the HTML within a CDATA container (Is it possible to insert HTML content in XML document?).

Community
  • 1
  • 1
Rintze Zelle
  • 1,654
  • 1
  • 14
  • 30
  • Not my data, it is from an external source so escaping it isn't possible. – Adrian Cornish May 23 '16 at 23:45
  • 1
    Well, it's easy for any XML parser to trip over the HTML. I guess you could sanitize it on import with e.g. a regular expression-based find-and-replace before feeding the file to the XML parser. – Rintze Zelle May 24 '16 at 01:49
  • so far so good - the html is good and I've added the few tags that are used to the rnc - not ideal - but it is working. – Adrian Cornish May 24 '16 at 03:14
  • That still wouldn't work with unclosed tags, though, right? (see https://www.w3.org/TR/html5/syntax.html#optional-tags for a full list) – Rintze Zelle May 24 '16 at 03:20
  • No it won't so far in 15K LOC of xml I've gone through the html is well formed. I expect though somewhere it is not - seems to me odd that I cannot absorb the value of an element 'as-is' – Adrian Cornish May 24 '16 at 03:31
  • Well, XML is just much more strict than HTML. – Rintze Zelle May 24 '16 at 03:36
  • And again, the CDATA sections are specifically meant to allow you to store things like raw HTML. See also http://stackoverflow.com/questions/2784183/what-does-cdata-in-xml-mean – Rintze Zelle Aug 10 '16 at 13:31
1

You won't be able to validate an XML document with non-well-formed HTML in it, since on account of the non-wellformedness such documents are not XML documents. But if in fact the input you're getting is XML, then you can certainly define data to allow any well-formed HTML elements, or any well-formed XML.

Allowing any well-formed XML is the simplest. We define a pattern than means "any well-formed XML here": any elements encountered are validated using the same pattern, recursively:

wellformed-xml = (text
                 | element * { wellformed-xml }
                 )*

Now define the data element to use that pattern:

stuff = element stuff {
            element data { wellformed-xml }
        }

If you really want to ensure that it's just HTML, you'll want a nameclass more restrictive than "*". I've populated it with b, i, p, span, and div, and leave it as an exercise to you to add the other elements you want.

start = stuff
stuff =
  element stuff {
    element data { wellformed-html }
  }

wellformed-html =
  (text
   | element b | div | i | p | span { wellformed-html }
   )*

If you want to be able to support XHTML input as well, you'll want to use a namespace reference; again, an exercise for the reader.

C. M. Sperberg-McQueen
  • 24,596
  • 5
  • 38
  • 65