How to get a substring in some length for special chars like Chinese

Question

For example, I can get 80 chars with {description?substring(0, 80)} if description is in English, but for Chinese chars, I can get only about 10 chars, and there is a garbage char at the end always.

How can I get 80 chars for any language?

That only happens when you have non-BMP characters, but AFAIK all the *commonly* used Chinese characters are inside the BMP. How frequent this problem is? I mean, Java doesn't support those characters well either, which is suspicious. — ddekany, Nov 12 '14 at 10:14

score 3 · Accepted Answer · edited May 23 '17 at 12:31

FreeMarker relies on String#substring to do the actual (UTF-16-chars-based?) substring calculation, which doesn't work well with Chinese characters. Instead one should uses Unicode code points. Based on this post and FreeMarker's own substring builtin I hacked together a FreeMarker TemplateMethodModelEx implementation which operates on code points:

public class CodePointSubstring implements TemplateMethodModelEx {

    @Override
    public Object exec(List args) throws TemplateModelException {
        int argCount = args.size(), left = 0, right = 0;
        String s = "";
        if (argCount != 3) {
            throw new TemplateModelException(
                    "Error: Expecting 1 string and 2 numerical arguments here");
        }
        try {
            TemplateScalarModel tsm = (TemplateScalarModel) args.get(0);
            s = tsm.getAsString();
        } catch (ClassCastException cce) {
            String mess = "Error: Expecting numerical argument here";
            throw new TemplateModelException(mess);
        }

        try {
            TemplateNumberModel tnm = (TemplateNumberModel) args.get(1);
            left = tnm.getAsNumber().intValue();

            tnm = (TemplateNumberModel) args.get(2);
            right = tnm.getAsNumber().intValue();

        } catch (ClassCastException cce) {
            String mess = "Error: Expecting numerical argument here";
            throw new TemplateModelException(mess);
        }
        return new SimpleScalar(getSubstring(s, left, right));
    }

    private String getSubstring(String s, int start, int end) {
        int[] codePoints = new int[end - start];
        int length = s.length();
        int i = 0;
        for (int offset = 0; offset < length && i < codePoints.length;) {
            int codepoint = s.codePointAt(offset);
            if (offset >= start) {
                codePoints[i] = codepoint;
                i++;
            }
            offset += Character.charCount(codepoint);
        }
        return new String(codePoints, 0, i);
    }
}

You can put an instance of it into your data model root, e.g.

SimpleHash root = new SimpleHash();
root.put("substring", new CodePointSubstring());
template.process(root, ...);

and use the custom substring method in FTL:

${substring(description, 0, 80)}

I tested it with non-Chinese characters, which still worked, but so far I haven't tried it with Chinese characters. Maybe you want to give it a try.

You say above that Java's string methods don't work well with Chinese characters. Actually the affected characters are only the non-BMP characters (which BTW also includes some uncommon mathematical symbols and such), but aren't those uncommon in Chinese? (China easily has the biggest FreeMarker user base, leaving US behind, so I'm surprised that I have never heard about this issue.) — ddekany, Nov 12 '14 at 10:10

How to get a substring in some length for special chars like Chinese

1 Answers1