Prevent re-encoding ampersands using Node's setTextContent method

Question

Background

Converting straight quotes into curled quotes and apostrophes within an XHTML document. Given a document with straight quotes (" and '), some pre-processing is performed to convert the straight quotes to their curled, semantic equivalents (“, ”, ‘, ’, and '). Typically, the curled character ’ is used for closing single quotes (’) and apostrophes ('), but this loses the semantic meaning, which I'd like to keep by using the entity instead---for subsequent translation to TeX (e.g., \quote{outer \quote{we’re inside quotes} outer}). Thus:

Markdown -> XHTML (straight) -> XHTML (curled) -> TeX

The code is using Java's built-in document object model (DOM) classes.

Problem

Calling Node's setTextContent method will double-encode any ampersand resulting in:

&amp;ldquo;I reckon, I&amp;apos;m &amp;apos;bout dat.&amp;rdquo;
&amp;ldquo;Elizabeth Davenport;&amp;rdquo; she said &amp;lsquo;Elizabeth&amp;rsquo; to be dignified, &amp;ldquo;and really my father owns the place.&amp;rdquo;

Rather than:

&ldquo;I reckon, I&apos;m &apos;bout dat.&rdquo;
&ldquo;Elizabeth Davenport;&rdquo; she said &lsquo;Elizabeth&rsquo; to be dignified, &ldquo;and really my father owns the place.&rdquo;

Disabling and enabling by setting the processing instruction didn't seem to work.

Code

Here's the code to walk a tree:

  public static void walk(
    final Document document, final String xpath,
    final Consumer<Node> consumer ) {
    assert document != null;
    assert consumer != null;

    try {
      final var expr = lookupXPathExpression( xpath );
      final var nodes = (NodeList) expr.evaluate( document, NODESET );

      if( nodes != null ) {
        for( int i = 0, len = nodes.getLength(); i < len; i++ ) {
          consumer.accept( nodes.item( i ) );
        }
      }
    } catch( final Exception ex ) {
      clue( ex );
    }
  }

Here's the code that replaces the quotes with curled equivalents:

walk(
  xhtml,
  "//*[normalize-space( text() ) != '']",
  node -> node.setTextContent( sConverter.apply( node.getTextContent() ) )
);

Where xhtml is the Document and sConverter curls quotes.

Question

How would you instruct the DOM to accept ' and friends without re-encoding the ampersand?

This was my first thought as well, but I believe he wants to attach some semantic meaning to `\u2019`; meaning, sometimes it represents the start of a nested quotation, and sometimes it’s just an apostrophe. – VGR Jun 28 '21 at 20:34
Represents the end of a nested quotation, I meant. – VGR Jun 28 '21 at 20:41
1

U+2019 is [preferred](http://www.unicode.org/reports/tr8/tr8-1.html) for the apostrophe. “U+2019 RIGHT SINGLE QUOTATION MARK is preferred where the character is to represent a punctuation mark, as in "We’ve been here before." In the latter case, U+2019 is also referred to as a punctuation apostrophe.” (The standard would be improved by defining two separate characters, though.) – Dave Jarvis Jun 29 '21 at 02:07
@DaveJarvis I'm sorry, I don't see the point of that comment. If you want to use `’` aka `\u2019` aka "right single quotation mark", then do that. The "apostrophe" is `'` aka `\u0027`. – Andreas Jun 29 '21 at 02:19
The statement `' should be '` is a little misleading because, according to the Unicode specification, `'` is not the preferred character. Rather, it's U+2019. Consequently, if the replacement character was coded to spec, it'd be rather difficult to disambiguate on the TeX-side. – Dave Jarvis Jun 29 '21 at 02:24
@DaveJarvis Both XML 1.0 and HTML 5 *defines* the `'` [entity](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML) as `'` aka `U+0027`. There is nothing "preferred" about that, it's a *definition*. If the XHTML document contains `'`, the XML parser should parse that as `'`, unless expansion of entity references is disabled. – Andreas Jun 29 '21 at 02:50
1

I gave up trying to tease the `'` through and gave up on the idea of allowing TeX to wrap the characters in `\quote{...}`. It _really sucks_ that `\u2019` is used for both curled closing quotes _and_ curled apostrophes. – Dave Jarvis Jun 29 '21 at 17:29
1

@DaveJarvis Agreed. They are different characters, conceptually, with very different meanings. Usually Unicode is good about such differentiations… – VGR Jun 29 '21 at 21:30

score 1 · Answer 2 · answered Jun 28 '21 at 20:36

XML processors are free to treat characters and character entities as interchangeable, so trying to use character entities to indicate semantic meaning is destined to fail.

I would use markup instead. I suspect custom processing instructions would be a good way to “stealthily” add semantic meaning:

<text>"She told me, 'Don't forget the bread.'"</text>

would get turned into:

<text><?q?>“She told me, <?q?>‘Don’t forget the bread.<?q?>’<?q?>”</text>

Where the <?q?> processing instruction is a signal that the following codepoint has semantic meaning as a quotation mark.

Of course, you can have more than one custom processing instruction if you want:

<text><?quote-start?>“She told me, <?quote-start?>‘Don't forget the bread.<?quote-end?>’<?quote-end?>”</text>

For what it’s worth, XHTML defines its own <quote> element to handle this exact case.

(Regular HTML has a <q> element which is semantically similar, but which also tells browsers to automatically render the quotation marks, which means an HTML document which uses <q> must not include quotation marks of its own.)

Unfortunately, this would be more effort on the TeX side, which was already developed to parse the entities. It's a good idea, though. — Dave Jarvis, Jun 29 '21 at 17:26

Prevent re-encoding ampersands using Node's setTextContent method

Background

Problem

Code

Question

Related

2 Answers2