1

How can I set the character encoding in RTF of characters that are in the UTF-8 character encoding format?

I studied similar questions, but did not fiund a good solution. So, I hope you can help.

The content is in a Sqlite database. The text in a Slqite database can only be formatted using UTF-8, UTF-16 or similar. So that's why I have to stick to UTF-8.

The e" is shown correctly using a Sqlite database browser.

The required target program, which can only read RTF, displays the characters in a strange way.

I tried for example:

{\rtf1\ansi\ansicpg0\uc0...
{\rtf1\ansi\ansicpg1252\uc0...
{\rtf1\ansi\ansicpg65001\uc0...

An option is by mapping the special characters to their RTF-char equivalences, as shown in this table.

tm1701
  • 7,307
  • 17
  • 79
  • 168

2 Answers2

3

The site you mentioned links to Unicode in RTF:

If the character is between 255 and 32,768, express it as \uc1\unumber*. For example, , character number 21,487, is \uc1\u21487* in RTF.

If the character is between 32,768 and 65,535, subtract 65,536 from it, and use the resulting negative number. For example, is character 36,947, so we subtract 65,536 to get -28,589 and we have \uc1\u-28589* in RTF.

If the character is over 65,535, then we can’t express it in RTF

Looks like RTF doesn't know UTF-8 at all, only Unicode in general. Other answers for Java and C# just use the \u directly.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31
1

I read in many places that RTF doesn't have a UTF-8 standard solution.

So, I created my own converter after scanning half the internet. If you have a standard/better solution, please let me know!

So after studying this book and I created a converter based on these character mappings. Great resources.

This solved my question. Re-using other solutions is what I would like to do for this kind of features, but I was not able to find one, alas.

The converter could be something like:

public static String convertHtmlToRtf(String html) {
    String tmp = html.replaceAll("\\R", " ")
            .replaceAll("\\\\", "\\\\\\\\")
            .replaceAll("\\{", "\\\\{")
            .replaceAll("}", "\\\\}");
    tmp = tmp.replaceAll("<a\\s+target=\"_blank\"\\s+href=[\"']([^\"']+?)[\"']\\s*>([^<]+?)</a>",
            "{\\\\field{\\\\*\\\\fldinst HYPERLINK \"$1\"}{\\\\fldrslt \\\\plain \\\\f2\\\\b\\\\fs20\\\\cf2 $2}}");
    tmp = tmp.replaceAll("<a\\s+href=[\"']([^\"']+?)[\"']\\s*>([^<]+?)</a>",
            "{\\\\field{\\\\*\\\\fldinst HYPERLINK \"$1\"}{\\\\fldrslt \\\\plain \\\\f2\\\\b\\\\fs20\\\\cf2 $2}}");

    tmp = tmp.replaceAll("<h3>", "\\\\line{\\\\b\\\\fs30{");
    tmp = tmp.replaceAll("</h3>", "}}\\\\line\\\\line ");
    tmp = tmp.replaceAll("<b>", "{\\\\b{");
    tmp = tmp.replaceAll("</b>", "}}");
    tmp = tmp.replaceAll("<strong>", "{\\\\b{");
    tmp = tmp.replaceAll("</strong>", "}}");
    tmp = tmp.replaceAll("<i>", "{\\\\i{");
    tmp = tmp.replaceAll("</i>", "}}");
    tmp = tmp.replaceAll("&amp;", "&");
    tmp = tmp.replaceAll("&quot;", "\"");
    tmp = tmp.replaceAll("&copy;", "{\\\\'a9}");
    tmp = tmp.replaceAll("&lt;", "<");
    tmp = tmp.replaceAll("&gt;", ">");
    tmp = tmp.replaceAll("<br/?><br/?>", "{\\\\pard \\\\par}\\\\line ");
    tmp = tmp.replaceAll("<br/?>", "\\\\line ");
    tmp = tmp.replaceAll("<BR>", "\\\\line ");
    tmp = tmp.replaceAll("<p[^>]*?>", "{\\\\pard ");
    tmp = tmp.replaceAll("</p>", " \\\\par}\\\\line ");
    tmp = convertSpecialCharsToRtfCodes(tmp);
    return "{\\rtf1\\ansi\\ansicpg0\\uc0\\deff0\\deflang0\\deflangfe0\\fs20{\\fonttbl{\\f0\\fnil Tahoma;}{\\f1\\fnil Tahoma;}{\\f2\\fnil\\fcharset0 Tahoma;}}{\\colortbl;\\red0\\green0\\blue0;\\red0\\green0\\blue255;\\red0\\green255\\blue0;\\red255\\green0\\blue0;}" + tmp + "}";
}

 private static String convertSpecialCharsToRtfCodes(String input) {
    char[] chars = input.toCharArray();
    StringBuffer sb = new StringBuffer();
    int length = chars.length;
    for (int i = 0; i < length; i++) {
        switch (chars[i]) {
            case '’':
                sb.append("{\\'92}");
                break;
            case '`':
                sb.append("{\\'60}");
                break;
            case '€':
                sb.append("{\\'80}");
                break;
            case '…':
                sb.append("{\\'85}");
                break;
            case '‘':
                sb.append("{\\'91}");
                break;
            case '̕':
                sb.append("{\\'92}");
                break;
            case '“':
                sb.append("{\\'93}");
                break;
            case '”':
                sb.append("{\\'94}");
                break;
            case '•':
                sb.append("{\\'95}");
                break;
            case '–':
            case '‒':
                sb.append("{\\'96}");
                break;
            case '—':
                sb.append("{\\'97}");
                break;
            case '©':
                sb.append("{\\'a9}");
                break;
            case '«':
                sb.append("{\\'ab}");
                break;
            case '±':
                sb.append("{\\'b1}");
                break;
            case '„':
                sb.append("\"");
                break;
            case '´':
                sb.append("{\\'b4}");
                break;
            case '¸':
                sb.append("{\\'b8}");
                break;
            case '»':
                sb.append("{\\'bb}");
                break;
            case '½':
                sb.append("{\\'bd}");
                break;
            case 'Ä':
                sb.append("{\\'c4}");
                break;
            case 'È':
                sb.append("{\\'c8}");
                break;
            case 'É':
                sb.append("{\\'c9}");
                break;
            case 'Ë':
                sb.append("{\\'cb}");
                break;
            case 'Ï':
                sb.append("{\\'cf}");
                break;
            case 'Í':
                sb.append("{\\'cd}");
                break;
            case 'Ó':
                sb.append("{\\'d3}");
                break;
            case 'Ö':
                sb.append("{\\'d6}");
                break;
            case 'Ü':
                sb.append("{\\'dc}");
                break;
            case 'Ú':
                sb.append("{\\'da}");
                break;
            case 'ß':
            case 'β':
                sb.append("{\\'df}");
                break;
            case 'à':
                sb.append("{\\'e0}");
                break;
            case 'á':
                sb.append("{\\'e1}");
                break;
            case 'ä':
                sb.append("{\\'e4}");
                break;
            case 'è':
                sb.append("{\\'e8}");
                break;
            case 'é':
                sb.append("{\\'e9}");
                break;
            case 'ê':
                sb.append("{\\'ea}");
                break;
            case 'ë':
                sb.append("{\\'eb}");
                break;
            case 'ï':
                sb.append("{\\'ef}");
                break;
            case 'í':
                sb.append("{\\'ed}");
                break;
            case 'ò':
                sb.append("{\\'f2}");
                break;
            case 'ó':
                sb.append("{\\'f3}");
                break;
            case 'ö':
                sb.append("{\\'f6}");
                break;
            case 'ú':
                sb.append("{\\'fa}");
                break;
            case 'ü':
                sb.append("{\\'fc}");
                break;
            default:
                if( chars[i] != ' ' && isSpaceChar( chars[i])) {
                    System.out.print( ".");
                    //sb.append("{\\~}");
                    sb.append(" ");
                } else if( chars[i] == 8218) {
                    System.out.println("Strange comma ... ");
                    sb.append(",");
                } else if( chars[i] > 132) {
                    System.err.println( "Special code that is not translated in RTF: '" + chars[i] + "', nummer=" + (int) chars[i]);
                    sb.append(chars[i]);
                } else {
                    sb.append(chars[i]);
                }
        }
    }
    return sb.toString();
}
tm1701
  • 7,307
  • 17
  • 79
  • 168
  • I see no UTF-8 in any of the linked sites, only Unicode. Which is in my answer already. – AmigoJack Apr 03 '21 at 18:48
  • @AmigoJack - Yes, only Unicode. That;s why I gave +1. Why downvoting now? – tm1701 Apr 04 '21 at 11:41
  • Because you asked for UTF-8, not Unicode. And what is in this answer was already in my answer. You neither posted code for your converter, nor did your question outline you were only interested in existing solutions. As of these reasons this answer has no value and shouldn't be accepted either. – AmigoJack Apr 04 '21 at 14:29
  • Your answer showed me a direction, that's why I up-voted it. With the answer I had to find how to move forward, that was not clear to me. We agree to disagree, no prob. – tm1701 Apr 04 '21 at 18:32
  • Added an example of a converter. – tm1701 Apr 04 '21 at 18:42
  • 1
    Now this is much more helpful to readers of this Q&A. – AmigoJack Apr 04 '21 at 20:37