0

I have to count the characters in a given String. I save the counts to a map Map<Character, Long>. The code does not work with some special symbols like "two hearts". When I convert such a special symbol into a character, then I get the compiler error "Too many characters in character literal" or similar. Why does this happen and how to fix it ?

Here is some rough code to demonstrate the problem. This is not the full code.

import java.util.HashMap;
import java.util.Map;

public class Demo {
    public static void main(String[]args){
        String twoHeartsStr = "";
        Map<Character, Long> output = new HashMap<>();
        output.put(twoHeartsStr.charAt(0), 1L);

        //Compiler error:
        //intellij IDE compiler : Too many characters in character literal.
        //java: unclosed character literal.
        Map<Character, Long> expectedOutput = Map.of('', 1L);
        System.out.println("Maps are equal : " + output.equals(expectedOutput));

    }
    
}

EDIT : Updated solution after getting answers to this question.

import java.util.HashMap;
import java.util.Map;

public class Demo {
    public static void main(String[]args){
        String twoHeartsStr = "";//Try #, alphabet, number etc.
        Map<String, Long> output = new HashMap<>();
        int codePoint = twoHeartsStr.codePointAt(0);
        String charValue = String.valueOf(Character.toChars(codePoint));//Size = 2 for twoHearts.
        output.put(charValue, 1L);

        Map<String, Long> expectedOutput = Map.of("", 1L);
        System.out.println("Maps are equal : " + output.equals(expectedOutput));//true.
    }
}
Anshul Sharma
  • 1,018
  • 3
  • 17
  • 39
MasterJoe
  • 2,103
  • 5
  • 32
  • 58
  • Did you really put an emoji into code – Ecto Jul 30 '20 at 00:53
  • 1
    @Ecto - yes. I expect to get strings with such emojis. That will break the character counter code. I want to prevent it from breaking. – MasterJoe Jul 30 '20 at 00:54
  • 1
    @Ecto It is perfectly allowed; Java programs are written in Unicode according to the Java Language Specification. https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1 – kaya3 Jul 30 '20 at 00:55
  • Use `String` instead. – Unmitigated Jul 30 '20 at 00:55
  • @hev1 - ok, but String.charAt(index) returns a char. Which method can return a String representation of a character instead of char ? – MasterJoe Jul 30 '20 at 00:58
  • 1
    The `char` type is obsolete, unable to represent even half of the over 140,000 characters defined in Unicode. Use Unicode [code point](https://en.wikipedia.org/wiki/Code_point) integer numbers instead. Read [*The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)*](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) – Basil Bourque Jul 30 '20 at 01:09
  • @BasilBourque - thanks. Btw, the article is from 2003. I wonder if it is up to date. I did a few quick searches on the page but could not find any mention of surrogate pairs or pair. I'll have to read to see if it also addresses the issue I mentioned in this question. – MasterJoe Jul 30 '20 at 01:48

2 Answers2

4

By Java's definition, "" is not one character; it is two:

>>> "".length()
2 (int)

So '' is a syntax error, because char is a 16-bit integer type, and the Unicode symbol is not represented by just one 16-bit integer value.

The solution to your problem is to use strings instead.

kaya3
  • 47,440
  • 4
  • 68
  • 97
  • Thanks. But, charAt gives a char. Which String method will give me a String instead of char, i.e. something like public String charAtPlus(index). – MasterJoe Jul 30 '20 at 01:02
  • You can use the `substring` method to return the string between two (character) indices; `s.substring(0, s.offsetByCodePoints(0, 1))` should return a string containing the first Unicode symbol of `s`. – kaya3 Jul 30 '20 at 01:04
  • No, incorrect, that emoji character is one character, not two. The problem is that the `char` type is obsolete, and incapable of representing that single character. – Basil Bourque Jul 30 '20 at 01:06
  • 1
    @BasilBourque Did you misread the part where I said "by Java's definition"? The [documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#length--) for the `String.length` method says it returns *"the length of the sequence of characters represented by this object."* – kaya3 Jul 30 '20 at 01:07
1

The code does not work with some special symbols like "two hearts"... Why does this happen

The Java char type is a 16-bit value. In the early days of Unicode, this was sufficient to store all the code-point values, but that quickly changed. The established Unicode specification allows for over a million characters, some of which need to be represented with a "surrogate pair".

From the documentation:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

Moving on:

twoHeartsStr.charAt(0)

This will give you the first half of the surrogate pair, which is not a valid character on its own despite being a valid char value (char is fundamentally an integer type rather than a textual type).

...and how to fix it ?

You can use 32-bit integers (i.e., int or Integer) to represent the values, and the codePointAt method to extract them from the string. Note, however, that when you iterate over the string, you'll still need to skip over the indices corresponding to the second halves of the pairs.

You still won't be able to store the "supplementary characters" in a char, so you won't be able to write them in char literals. So to look up the two-hearts character in the resulting histogram (or to populate your reference data for testing), you'll want to get the integer code-point value from a string with that symbol.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • 1
    Also keep in mind because of this that the `length` method of strings does *not* count the characters, but the `char`s needed to represent them. This sort of thing - the clunky, un-modern string handling - is one of the many reasons I abandoned Java. – Karl Knechtel Jul 30 '20 at 01:05
  • So, which language do you prefer just for the string handling features ? – MasterJoe Jul 30 '20 at 05:06