Using Java, how is this charAt()
turn a string into an int?
The Java String
models a string as an array of char
(not int
) values. So charAt
is just indexing the (conceptual) array. So you cn say that the string is integer values ... representing characters.
(Under the hood, different versions of Java actually use a variety of implementation approaches. In some versions, the actual representation is not a char[]
. But that is all hidden from site ... and you can safely ignore it.
So my question is where does unicode number comes from?
It comes from the code that created the String
; i.e. the code than called new String(...)
.
If the String
is constructed from a char[]
, it is assumed that the characters in the array are UTF-16 codeunits in a sequence that is a valid UTF-16 representation.
If the String
is constructed from a byte[]
, the byte sequence is decoded from some specified or implied encoding. If you supply an encoding (e.g. Charset
) that will be used. Otherwise the application's default encoding is used. Either way, the decoder is responsible for producing valid Unicode.
Sometimes these things break. For instance if your application provides a byte[]
encoded in one encoding and tells the String
constructor it is a different encoding, you are liable to get nonsense Unicode in the String
. Often called mojibake.
How does it know that it's unicode?
String
is designed to be Unicode based.
The code that needs to know is the code that is forming the strings from other things. The String
class just assumes that it content is meaningful. (At one level ... it doesn't care. You can populate a String
with malformed UTF-16 or total nonsense. The String
will faithfully record and reproduce the nonsense.)
Having said that, there is an important mistake in your code.
The charAt
method does not return a Unicode codepoint. A String is primarily modeled as a sequence of UTF-16 codeunits, and charAt
returns those.
Unicode codepoints are actually numbers in the range range 0hex to 10FFFFhex. That doesn't fit into a char
... which is limited to 0hex to FFFFhex.
UTF-16 encodes Unicode codepoints into 16 bit codeunits. So, the value returned by charAt
represents either an entire Unicode codepoint (for codepoints in the range 0hex to FFFFhex) or the top or bottom part of a codepoint (for codepoints larger than FFFFhex).
If you want String
to return (complete) Unicode codepoints, you need to use String.codePointAt
. But it is important to read the javadocs carefully to understand how the method should be used. (It may be simpler to use the String.codePoints()
method.)
At any rate, what this means is that your code is NOT assigning a Unicode codepoint to finalInt
in all cases. It works for Unicode characters in the BMP (code plane zero) but not the higher code planes. It will break for the Unicode codepoints for Emojis, for example.