-1

Problem

How can I get only 5 characters of the string if sometimes encoding looks like "UTF-8", "UTF-16" and "ASCII"?

Note: some of the tests input has emoji.

Code

    public String truncate(String input) {
        if (input.codePointCount(0, input.length()) > 5)
        {
            return input.substring(0, input.offsetByCodePoints(0, 5));
        }

        return input;
    }

For example:

Input: Bärteppich

Expected Output: BГ¤rte also means Bärte

Actual Output: Bärt

Input: brühe

Expected Output: brГјhe also means brühe

Actual Output: brГјh

  • 4
    Why do you **intentionally** want to get [Mojibake](https://en.wikipedia.org/wiki/Mojibake) out of Strings? `BГ¤rte` doesn't "also mean" `Bärte`, it means you don't handle encoding correctly. – Kayaman Jul 01 '20 at 14:14
  • I assume the odd rendering is due to print and and not a corrupt String. Does char by char print of input and output look to have expected values? eg try adding this to the before and after `System.out.println("input "+Arrays.toString(input.toCharArray()));` – DuncG Jul 01 '20 at 14:41
  • @DuncG even the question implies that they're reading different encodings but treating them as the same (possibly the platform default), which is the whole root of their problem. – Kayaman Jul 01 '20 at 14:47
  • @Kayaman It looks likely yes, but just because the chars don't print to terminal does not imply the input is corrupt which is why I asked. eg its possible to read Japanese database text into UK base machines as String and write 100% safely, but there was no way I could print that text to terminal on my machine as it was writing as it would show as garbled text in machine with ASCII default. – DuncG Jul 01 '20 at 14:58
  • @DuncG I know. I also know from the dozens of encoding related questions I've gone through and tried to answer, that people just don't understand encoding. They refuse to accept that their input data is broken, and code should be fixed when it's read in, instead they want hacky code to work on the broken data. – Kayaman Jul 01 '20 at 15:05

2 Answers2

0

First, for all purposes a Java String is always UTF-16, although since Java 9 it may be something else internally.

To achieve what you want ("Get only the first five characters from the input String!"), it should look like this:

public String truncate( String input ) 
{
    var retValue = (input != null) && (input.length() > 5)
        ? input.substring( 0, 5 )
        : input;

        return retValue;
}

There should be no need to play around with codepoints for this particular task.

Unfortunately, this is not fully correct.

It works for the String s = "Dies ist ein langer String";.

It does not work for s = "12345678";.

Unfortunately, String.offsetByCodePoints() is of no help here; when using the original code from the question, like this:

public String truncate( String input ) 
{
    int x = 5;
    if( input.codePointCount( 0, input.length() ) > 5 )
    {
        return input.substring( 0, input.offsetByCodePoints( 0, x ) );
    }

    return input;
}

the correct value for x depends on the contents of the String.

That's because counts for two codepoints, while is just one – and both are more than one char.

So this one failed, too:

public String truncate( String input ) 
{
    var retValue = input;
    if( input.codePointCount( 0, input.length() ) > 5 )
    {
        int [] codepoints = input.codePoints().limit( 5 ).toArray();
        retValue = new String( codepoints, 0, 5 );
    }
    return retValue;
}

And here I am stuck …

tquadrat
  • 3,033
  • 1
  • 16
  • 29
  • 1
    It's UTF-16 only [pre-Java9](https://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16). – Kayaman Jul 01 '20 at 14:49
  • @Kayaman: You are right, but for the purpose of this question, it is transparent whether it is UTF-16 or UTF-8 or ISO-8859-1 – not to mention that the conversion from one the other is transparent, too. – tquadrat Jul 01 '20 at 14:54
  • @Kayaman I disagree. The String class, like every good OO class, encapsulates its state, so internal representation is irrelevant. As far as any program is concerned, a String is comprised of UTF-16 `char` values, or 32-bit Unicode codepoints. – VGR Jul 01 '20 at 15:50
  • @Kayaman Ah, I overlooked that tquadrat wrote that internally it’s UTF-16 until you pointed that out. Fair enough. – VGR Jul 01 '20 at 15:57
  • @tquadrat Your answer is nicer now, and your truncate() returns same result as in my answer does for input `s = "1234\uD83C\uDDE6\uD83C\uDDE8"`, both make truncate(s)` as "1234\uD83C\uDDE6" which is a String of length 6, with only first of the pair. It looks like String `codePointCount()` is only intended for the 2-char definitions of char+surrogate and not cases like this flag which would use 4 chars. In my day a byte was a char! – DuncG Jul 02 '20 at 08:54
-1

If the string is valid and contains codepoints, length passed to offsetByCodePoints should be 5 not 6 to split the string at the end of 5 code point positions?

public String truncate(String input) {
    if (input.codePointCount(0, input.length()) > 5)
    {
        input = input.substring(0, input.offsetByCodePoints(0, 5));
    }

    return input;
}
DuncG
  • 12,137
  • 2
  • 21
  • 33
  • See https://stackoverflow.com/questions/8521226/what-java-function-offsetbycodepoints-really-takes-as-an-argument for an explanation of `String.offsetByCodePoints()`. `String.length()` counts the characters in the String, and `String.substring()` returns the characters – and that is the requirement for the function. No need for playing around with codepoints here! – tquadrat Jul 01 '20 at 16:51
  • I will admit to never having used an emoji ever! But my understanding is that this contains some: `String s = "1234\uD83C\uDDE6\uD83C\uDDE8"` and `s.offsetByCodePoints(0, 5)` returns 6, so if this string is to be truncated at 5th code point then the call is `s.substring(0, input.offsetByCodePoints(0, 5))` and that string is then 6 chars long. Now I have more reason to avoid them. – DuncG Jul 01 '20 at 17:03