0

Background

Converting straight quotes into curled quotes and apostrophes within an XHTML document. Given a document with straight quotes (" and '), some pre-processing is performed to convert the straight quotes to their curled, semantic equivalents (“, ”, ‘, ’, and '). Typically, the curled character is used for closing single quotes (’) and apostrophes ('), but this loses the semantic meaning, which I'd like to keep by using the entity instead---for subsequent translation to TeX (e.g., \quote{outer \quote{we’re inside quotes} outer}). Thus:

Markdown -> XHTML (straight) -> XHTML (curled) -> TeX

The code is using Java's built-in document object model (DOM) classes.

Problem

Calling Node's setTextContent method will double-encode any ampersand resulting in:

“I reckon, I'm 'bout dat.”
“Elizabeth Davenport;” she said ‘Elizabeth’ to be dignified, “and really my father owns the place.”

Rather than:

“I reckon, I'm 'bout dat.”
“Elizabeth Davenport;” she said ‘Elizabeth’ to be dignified, “and really my father owns the place.”

Disabling and enabling by setting the processing instruction didn't seem to work.

Code

Here's the code to walk a tree:

  public static void walk(
    final Document document, final String xpath,
    final Consumer<Node> consumer ) {
    assert document != null;
    assert consumer != null;

    try {
      final var expr = lookupXPathExpression( xpath );
      final var nodes = (NodeList) expr.evaluate( document, NODESET );

      if( nodes != null ) {
        for( int i = 0, len = nodes.getLength(); i < len; i++ ) {
          consumer.accept( nodes.item( i ) );
        }
      }
    } catch( final Exception ex ) {
      clue( ex );
    }
  }

Here's the code that replaces the quotes with curled equivalents:

walk(
  xhtml,
  "//*[normalize-space( text() ) != '']",
  node -> node.setTextContent( sConverter.apply( node.getTextContent() ) )
);

Where xhtml is the Document and sConverter curls quotes.

Question

How would you instruct the DOM to accept &apos; and friends without re-encoding the ampersand?

Related

Semi-related questions:

Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315

2 Answers2

1

Change the pre-processing to replace straight quotes with Unicode characters, not with invalid XML entities. Those entities are defined by HTML, and is not valid XML.

  • &ldquo; should be or \u201C if written as Java literal
  • &rdquo; should be or \u201D if written as Java literal
  • &lsquo; should be or \u2018 if written as Java literal
  • &rsquo; should be or \u2019 if written as Java literal
  • &apos; should be '
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • This was my first thought as well, but I believe he wants to attach some semantic meaning to `\u2019`; meaning, sometimes it represents the start of a nested quotation, and sometimes it’s just an apostrophe. – VGR Jun 28 '21 at 20:34
  • Represents the end of a nested quotation, I meant. – VGR Jun 28 '21 at 20:41
  • 1
    U+2019 is [preferred](http://www.unicode.org/reports/tr8/tr8-1.html) for the apostrophe. “U+2019 RIGHT SINGLE QUOTATION MARK is preferred where the character is to represent a punctuation mark, as in "We’ve been here before." In the latter case, U+2019 is also referred to as a punctuation apostrophe.” (The standard would be improved by defining two separate characters, though.) – Dave Jarvis Jun 29 '21 at 02:07
  • @DaveJarvis I'm sorry, I don't see the point of that comment. If you want to use `’` aka `\u2019` aka "right single quotation mark", then do that. The "apostrophe" is `'` aka `\u0027`. – Andreas Jun 29 '21 at 02:19
  • The statement `' should be '` is a little misleading because, according to the Unicode specification, `'` is not the preferred character. Rather, it's U+2019. Consequently, if the replacement character was coded to spec, it'd be rather difficult to disambiguate on the TeX-side. – Dave Jarvis Jun 29 '21 at 02:24
  • @DaveJarvis Both XML 1.0 and HTML 5 *defines* the `'` [entity](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML) as `'` aka `U+0027`. There is nothing "preferred" about that, it's a *definition*. If the XHTML document contains `'`, the XML parser should parse that as `'`, unless expansion of entity references is disabled. – Andreas Jun 29 '21 at 02:50
  • 1
    I gave up trying to tease the `'` through and gave up on the idea of allowing TeX to wrap the characters in `\quote{...}`. It _really sucks_ that `\u2019` is used for both curled closing quotes _and_ curled apostrophes. – Dave Jarvis Jun 29 '21 at 17:29
  • 1
    @DaveJarvis Agreed. They are different characters, conceptually, with very different meanings. Usually Unicode is good about such differentiations… – VGR Jun 29 '21 at 21:30
1

XML processors are free to treat characters and character entities as interchangeable, so trying to use character entities to indicate semantic meaning is destined to fail.

I would use markup instead. I suspect custom processing instructions would be a good way to “stealthily” add semantic meaning:

<text>"She told me, 'Don't forget the bread.'"</text>

would get turned into:

<text><?q?>“She told me, <?q?>‘Don’t forget the bread.<?q?>’<?q?>”</text>

Where the <?q?> processing instruction is a signal that the following codepoint has semantic meaning as a quotation mark.

Of course, you can have more than one custom processing instruction if you want:

<text><?quote-start?>“She told me, <?quote-start?>‘Don't forget the bread.<?quote-end?>’<?quote-end?>”</text>

For what it’s worth, XHTML defines its own <quote> element to handle this exact case.

(Regular HTML has a <q> element which is semantically similar, but which also tells browsers to automatically render the quotation marks, which means an HTML document which uses <q> must not include quotation marks of its own.)

VGR
  • 40,506
  • 4
  • 48
  • 63
  • Unfortunately, this would be more effort on the TeX side, which was already developed to parse the entities. It's a good idea, though. – Dave Jarvis Jun 29 '21 at 17:26