0

I'm a beginner at Java and I'm trying to understand and explain to myself how this for loop is working. The instructions say it's converting the numeric Unicode equivalent for each letter in each word by using loops.

Based on my understanding the for loop goes through the entire word using the .length() and then stores it as int i, which gets carried down into the parenthesis of i of the charAt. CharAt returns each character in the word and then the int converts it into an int that is stored as finalInt.

So my question is where does unicode number comes from? How does it know that it's unicode?

String word1;
int finalInt; 

for (int i = 0; i < word1.length(); i++) {
    finalInt = (int) word1.charAt(i);

        }
  • 3
    `char` is a numerical type that happens to be representable as a character. Converting to `int` just exposes the underlying numerical value. – shmosel May 28 '19 at 21:18
  • Ahh, I think that adds a bit of clarification! I will have to remember that! Thanks! – JackieTowns May 28 '19 at 21:43

3 Answers3

3

Java Character is based on Unicode

Character information is based on the Unicode Standard, version 6.0.0.

https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html

Besides, char and int can convert each other. please refer to : Convert int to char in java

Qingfei Yuan
  • 1,196
  • 1
  • 8
  • 12
  • Thanks so much for the reference material. I will look them over. I've been trying to find the research that covers this but apparently I didn't know enough to know what I was looking for. Thanks so much!! Do you know if there is a list of what each character represents in Unicode? I've come across a lot of lists but I don't know which one is correct. – JackieTowns May 28 '19 at 21:42
  • @JackieTowns Wikipedia has a [list of Unicode characters](https://en.m.wikipedia.org/wiki/List_of_Unicode_characters), but you may find other web sites more accessible. The official latest list is kept at the site for the Unicode consortium: http://www.Unicode.org/. There are about 138,000 characters defined so far, and growing. If you are a Mac user, download the [UnicodeChecker](https://earthlingsoft.net/UnicodeChecker/) app. – Basil Bourque May 28 '19 at 22:41
2

Check the ASCII table - http://www.asciitable.com/
Your code is transforming a char (last column) into its numerical representation (first column).

AdrianM
  • 175
  • 7
  • 1
    Wow!! This is extremely helpful!! I've been looking for this kind of thing!! Can you explain why the numbers are listed as DEC? Does that mean decimal or something? And why is it listed like that if you don't mind? And I know I might be repeating but where does unicode come into all of this? I can't seem to link up Unicode and ASCII.... – JackieTowns May 28 '19 at 22:13
  • 1
    @JackieTowns Yes, decimal (base10), hexadecimal (base 16), octal (base 8), and HTML character entity. And Unicode is a superset of ASCII. See Wikipedia for more info on all that. – Basil Bourque May 28 '19 at 22:21
  • @JackieTowns And read this: [*The absolute minimum every developer absolutely positively must know about Unicode and character sets (no excuses!)*](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) – Basil Bourque May 28 '19 at 22:26
  • @BasilBourque Thanks so much for all the reference material!! – JackieTowns May 28 '19 at 22:30
  • That reference is for those that talk about ASCII or ANSI and could be confused or unsure how to properly use them or how to learn Unicode. Java users need only read the Java documentation to learn a `char` is a UTF-16 code unit and that UTF-16 is one of several character encodings of the Unicode character set. We all know that we're not using ASCII, right ? – Tom Blodget May 28 '19 at 23:10
  • enough to explain the concept I assume ;) – AdrianM May 28 '19 at 23:14
0

Using Java, how is this charAt() turn a string into an int?

The Java String models a string as an array of char (not int) values. So charAt is just indexing the (conceptual) array. So you cn say that the string is integer values ... representing characters.

(Under the hood, different versions of Java actually use a variety of implementation approaches. In some versions, the actual representation is not a char[]. But that is all hidden from site ... and you can safely ignore it.

So my question is where does unicode number comes from?

It comes from the code that created the String; i.e. the code than called new String(...).

  • If the String is constructed from a char[], it is assumed that the characters in the array are UTF-16 codeunits in a sequence that is a valid UTF-16 representation.

  • If the String is constructed from a byte[], the byte sequence is decoded from some specified or implied encoding. If you supply an encoding (e.g. Charset) that will be used. Otherwise the application's default encoding is used. Either way, the decoder is responsible for producing valid Unicode.

Sometimes these things break. For instance if your application provides a byte[] encoded in one encoding and tells the String constructor it is a different encoding, you are liable to get nonsense Unicode in the String. Often called mojibake.

How does it know that it's unicode?

String is designed to be Unicode based.

The code that needs to know is the code that is forming the strings from other things. The String class just assumes that it content is meaningful. (At one level ... it doesn't care. You can populate a String with malformed UTF-16 or total nonsense. The String will faithfully record and reproduce the nonsense.)


Having said that, there is an important mistake in your code.

The charAt method does not return a Unicode codepoint. A String is primarily modeled as a sequence of UTF-16 codeunits, and charAt returns those.

Unicode codepoints are actually numbers in the range range 0hex to 10FFFFhex. That doesn't fit into a char ... which is limited to 0hex to FFFFhex.

UTF-16 encodes Unicode codepoints into 16 bit codeunits. So, the value returned by charAt represents either an entire Unicode codepoint (for codepoints in the range 0hex to FFFFhex) or the top or bottom part of a codepoint (for codepoints larger than FFFFhex).

If you want String to return (complete) Unicode codepoints, you need to use String.codePointAt. But it is important to read the javadocs carefully to understand how the method should be used. (It may be simpler to use the String.codePoints() method.)

At any rate, what this means is that your code is NOT assigning a Unicode codepoint to finalInt in all cases. It works for Unicode characters in the BMP (code plane zero) but not the higher code planes. It will break for the Unicode codepoints for Emojis, for example.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216