My code is in Scala.js, but I think the gist of it should be easy to understand from a JavaScript perspective:
def htmlToXHTML(input: String)
(implicit parser: DOMParser, serializer: XMLSerializer): String = {
val doc = parser.parseFromString(input, "text/html")
val body = getElementByXpath("/html/body", doc).singleNodeValue
val bodyXmlString = serializer.serializeToString(body)
val xmldoc = parser.parseFromString(bodyXmlString, "application/xml")
val xmlDocElems: NodeList = xmldoc.getElementsByTagName("*")
xmlDocElems.foreach{
case elem: Element =>
elem.removeAttribute("xmlns")
println(s"Found element $elem with html: ${elem.outerHTML}")
case node => println(s"Warning: found unexpected non-element node: $node.")
}
xmldoc.firstElementChild.innerHTML
}
This is used above, so including it for completeness (https://stackoverflow.com/a/14284815/3096687):
def getElementByXpath(xpath: String, doc: Document): XPathResult =
doc.evaluate(
xpath, doc, null.asInstanceOf[XPathNSResolver],
XPathResult.FIRST_ORDERED_NODE_TYPE, null
)
In short, this function reads an HTML string, converts it to an HTML document, serializes to XML, reparses as XML, and finds all the elements in the doc and loops over them (foreach
), and then removes the xmlns
attribute. It seems that the resulting innerHTML, however, still has the xmlns
attributes on elements, even though the first println
(aka console.log
) indicates we are finding the elements in question, but not removing the xmlns
attributes.
The problem may derive from default values specified in a DTD:
If a default value for the attribute is defined in a DTD, a new attribute immediately appears with the default value