Background
Converting straight quotes into curled quotes and apostrophes within an XHTML document. Given a document with straight quotes ("
and '
), some pre-processing is performed to convert the straight quotes to their curled, semantic equivalents (“
, ”
, ‘
, ’
, and '
). Typically, the curled character ’
is used for closing single quotes (’
) and apostrophes ('
), but this loses the semantic meaning, which I'd like to keep by using the entity instead---for subsequent translation to TeX (e.g., \quote{outer \quote{we’re inside quotes} outer}
). Thus:
Markdown -> XHTML (straight) -> XHTML (curled) -> TeX
The code is using Java's built-in document object model (DOM) classes.
Problem
Calling Node
's setTextContent
method will double-encode any ampersand resulting in:
“I reckon, I'm 'bout dat.”
“Elizabeth Davenport;” she said ‘Elizabeth’ to be dignified, “and really my father owns the place.”
Rather than:
“I reckon, I'm 'bout dat.”
“Elizabeth Davenport;” she said ‘Elizabeth’ to be dignified, “and really my father owns the place.”
Disabling and enabling by setting the processing instruction didn't seem to work.
Code
Here's the code to walk
a tree:
public static void walk(
final Document document, final String xpath,
final Consumer<Node> consumer ) {
assert document != null;
assert consumer != null;
try {
final var expr = lookupXPathExpression( xpath );
final var nodes = (NodeList) expr.evaluate( document, NODESET );
if( nodes != null ) {
for( int i = 0, len = nodes.getLength(); i < len; i++ ) {
consumer.accept( nodes.item( i ) );
}
}
} catch( final Exception ex ) {
clue( ex );
}
}
Here's the code that replaces the quotes with curled equivalents:
walk(
xhtml,
"//*[normalize-space( text() ) != '']",
node -> node.setTextContent( sConverter.apply( node.getTextContent() ) )
);
Where xhtml
is the Document
and sConverter
curls quotes.
Question
How would you instruct the DOM to accept '
and friends without re-encoding the ampersand?
Related
Semi-related questions: