I’m parsing some XML (the document.xml
payload of an MS-Word .docx
if that matters) for which I need scrupulously-correct character offsets. I’m using Cocoa (OS X)’s NSXMLDocument
family for tree parsing. I’ve solved most of the problems, except that the parser reports runs of space characters as single spaces.
The atom for runs of text in this document is <w:t/>
. In some cases, there is a single-space run:
<w:t xml:space="preserve"> </w:t>
The space was suppressed until I supplied the NSXMLDocumentTidyXML
option when I instantiated the top-level XML object:
let xmlDocument = try? NSXMLDocument(data: fileData, options: NSXMLDocumentTidyXML)
Great, but it doesn’t solve everything. Consider this in the XML content:
<w:t>available to be digitized and posted. But while there</w:t>
You may notice there are two spaces after the period. The NSXMLElement
representing the <w:t/>
, and any element containing it, insists that there is only one space after the period, as reported by theElement.stringValue!
and the debugging representation of the node:
<w:t>available to be digitized and posted. But while there</w:t>
I could live with this, but my count has to be consistent with the renderers in Pages, Word, and NSAttributedString
, all of which preserve runs of spaces as such.
I’ve tried brute-forcing the <w:t/>
elements by imposing xml:space="preserve"
on all of them:
let spacePreserveAttribute = NSXMLNode.attributeWithName("xml:space", stringValue: "preserve") as! NSXMLNode
// ...
if let tElements = try? graf.nodesForXPath("descendant::w:t") as! [NSXMLElement] {
for t in tElements {
var tAttrs: [NSXMLNode] = t.attributes ?? []
tAttrs.append(
spacePreserveAttribute.copy() as! NSXMLNode
)
t.attributes = tAttrs
}
}
I’m prepared to believe this is bad code, but one way or another, it has no effect on the problem. It’s probably too late by this time.
How do I get the Cocoa XML tree parser to stop eliding runs of spaces into one space? Surely this is a solved problem — the world has not resigned itself to having its spaces collapsed.
Please can I avoid a third-party XML parser?