1

I was under the impression that UTF-8 was the answer to everything :0

Problem: Using Play's idiomatic form handling to go from a web page (basic HTML Text Area Input field) to a MySQL database through the Anorm abstraction layer (so all properly escaped) and then reading the database to gather that data and create an email using the JavaMail API's to send HTML email with alternate characters (accented characters like é for example. (I'd post more but I suspect we might get strange artifacts here as well -- I'll try that in a comment below perhaps)

I can use a moderate set of characters and create a TEXT email (edited via Atom and placed into the stream directly at the code level) and it comes through as an email with all the characters I've chosen in tact.

I have not yet systematically worked through the characters I was just using a relatively random sampling as an initial test.

I place the same set of characters into a text field and try to save them to the database and I can only save about 1 in 5 or less of them.

The errors look like this:

SQLException: Incorrect string value: '\xC4\x93\x0D\x0A\x0D\x0A...' for column 'content' at row 1

I suspect I'm about to learn a ton of new information about either Play and/or UTF-8 or HTML or some part of the chain where this is going off the rails.

My question then is this: Is there an idiomatic Play example of how to handle UTF-8 end to end through Anorm and into Java Mail?

(I think I kinda expected it to be "built-in" but then I expected a LOT more to be baked into the core product as well...)

I want/need both a TEXT and and HTML path for the email portion. (I can write BOTH and they work fine -- the problem is moving alternate characters though the channels as indicated above).

Techmag
  • 1,383
  • 13
  • 24
  • The error being at JDBC level, it seems to me to have nothing to do specifically with Anorm/Play. – cchantep Jul 14 '15 at 23:02
  • Except for the fact that I've followed the Play playbook to build the app so why should we get an error on text entered in a field going to a database esp once it is processed (sanitized) by the Anorm driver? I'm not "blaming" play - I'm just asking where under the hood do I start looking to address this issue? – Techmag Jul 15 '15 at 13:14
  • The error is raise by your JDBC driver not understanding the text value in your DB. It has nothing to do with "upper" application features for me. – cchantep Jul 15 '15 at 14:24
  • I get that - but if the playbook say built it this way and I do and it doesn't work then either I'm reading the playbook wrong OR the playbook is wrong. I'm just trying to figure out how I proceed next. I don't care which is wrong just need to know to make intelligent choices. – Techmag Jul 15 '15 at 14:56

2 Answers2

0

I'm currently seeing if this might be an answer:

https://objectpartners.com/2013/04/24/html-encoding-utf-8-characters/

However presently hitting this roadblock...

How to turn off specific Implicit's in Scala that prevent code from compiling due to overloaded methods?

Community
  • 1
  • 1
Techmag
  • 1,383
  • 13
  • 24
  • This is NOT the answer -- it might be on the right track but it is not working (for me) yet -- what came out the other end didn't live long thankfully... – Techmag Jul 15 '15 at 18:43
0

This appears to be a hopeful candidate -- I am researching it now end to end.

import org.apache.commons.lang3._

def htmlEncode(input: String) = htmlEncode_sb(input).toString    

def htmlEncode_sb(input: String, stringBuilder: StringBuilder = new StringBuilder()) = {
    stringBuilder.synchronized {
      for ((c, i) <- input.zipWithIndex) {
        if (CharUtils.isAscii(c)) {
          // Encode common HTML equivalent characters
          stringBuilder.append(StringEscapeUtils.escapeHtml4(c.toString()))
        } else {
          // Why isn't this done in escapeHtml4()?
          stringBuilder.append(s"""&#${Character.codePointAt(input, i)};""")
        }
      }
      stringBuilder
    }
  }

In order to get it to work inside Play you'll need this in your build.sbt file

  "org.apache.commons" % "commons-lang3" % "3.4",

This blog post lead me to write that code: https://objectpartners.com/2013/04/24/html-encoding-utf-8-characters/

Update: Confirmed that it does work end to end.

Web Page Input as TextArea inside a Form saved to MySQL database escaped by Anorm, reread from database and displayed inside a TextArea on a web page with extended characters (visually) appearing precisely as input.

You'll need to call @Html(htmlContentString) inside the Twirl template to re-render this as the original HTML but the browser (Safari 8.0.7) displayed exactly what I gave it after a round trip to and from the database.

One caveat -- it creates machine readable HTML not human readable HTML. It would be nice if it didn't encode angle brackets and such so it looks more like HTML that we expect. I'm sure a pattern match block will be added next to exclude just that :)

Techmag
  • 1,383
  • 13
  • 24
  • Also confirmed no "creepage" on multiple beck to back round trips. E.g. saving and then redisplaying what was placed on screen on the previous test many times over. – Techmag Jul 15 '15 at 20:23
  • This doesn't seem to handle surrogate pairs (e.g. emojis) correctly, try it with `"\uD83D\uDE00"` as the input. See [this answer](http://stackoverflow.com/a/37040891/305973) for more details. – robinst May 05 '16 at 01:53