How to properly handle string lengths and encodings in Android?

Question

What am I confused about

Hello, I am currently very confused about the state of charsets and how to best handle them in Android. Sources also seem to provide conflicting information.

1) According to this post, the Java JVM defaults to UTF-16
What is the Java's internal represention for String? Modified UTF-8? UTF-16?

2) According to this post, the Java JVM (under Android) defaults to UTF-8
Android default character encoding
https://developer.android.com/reference/java/nio/charset/Charset.html#defaultCharset()

What I am currently working with

1) I have an Android application that is minSdkVersion 17.

2) It has no settings in its gradle file/manifest file/any file that specifies any preferences regarding charset or encoding.

3) It uses AppCompatEditText and AppCompatTextView from com.android.support:appcompat-v7 some of them constrained with the xml property android:maxLength="140"

4) Code that uses String myText = myView.getText().toString(); to get contents.

5) Code that uses myView.setText(myText); to set contents.

6) Code that uses myText.length(); to measure contents. (e.i. update a "characters remaining" view according to android:maxLength="140")

basically this remaining.setText(String.valueOf(140 - myText.length())); hooked up to a TextWatcher.onTextChanged event listener

What I need help with

1) A way to standardize both Java and Android to use the same charset for my application (either forcing them to use UTF-16 or UTF-8 - I don't want to deal with weird corner cases that might arise from working with 2 different charsets (or if the user sets the defaults to something different - can they even do that? IDK WHAT IS GOING ON AAAAAAAAAH)

2) A way to standardize behavior of android:maxLength="140" and remaining.setText(String.valueOf(140 - myText.length())); - I don't know what Android does under the hood with android:maxLength, and I need to ensure that remaining.setText(String.valueOf(140 - myText.length())); doesn't come back negative because maybe android:maxLength doesn't use String.length(); but some other weird codepoint measuring system.

3) A way to properly encode and decode String myText data to and from HTTPS to backend Django servers.

Sorry if the question is bad or vague. I'm just really drowning in charset hell right now. I just need some sort of straight forward checklist to just make things "work" in Android right now...

Is this just an academic question, or are you encountering an actual issue? If so, what is it exactly? — Remy Lebeau, Jan 11 '18 at 21:04
@RemyLebeau I don't think it is good practice to only deal with a software problem until something blows up... — AlanSTACK, Jan 11 '18 at 21:21

score 1 · Accepted Answer · answered Jan 11 '18 at 21:03

1

The Android JVM is still a Java JVM, and so it has to follow the Java spec, and that spec says that char is 2 bytes and String uses UTF-16 for its public interface, regardless of the internal representation of the character data in memory. This is stated in Peter's answer to What is the Java's internal represention for String? Modified UTF-8? UTF-16? that you linked to.

Charsets don't apply to String-only operations, like accessing characters, assigning String values, etc. Everything inside the Java app is UTF-16. Charsets only apply to serialization operations when converting between String and bytes, such as during file I/O, network I/O, etc. Things where String data enters/leaves the Java app.

answered Jan 11 '18 at 21:03

Remy Lebeau

555,201
31
458
770

is `String.length();` returning the number of Unicode extended grapheme clusters (e.i. what a human would interpret as a character) or the number of `UTF-16` code points? What about `android:maxLength`? – AlanSTACK Jan 11 '18 at 21:13
Neither. They represent the number of `char` elements in a `String`, ie the number of UTF-16 **code units**, not **code points** (`String` has methods for working with codepoints, like `codePointAt`, `codePointBefore`, `codePointCount`, `offsetByCodePoint`, etc). – Remy Lebeau Jan 11 '18 at 21:27
So `String.length();` returns number of `UTF-16 code units` and `android:maxLength="140"` bounds the `UTF-16 code units` to less than or equal to 140? So 140 16 bit code units? – AlanSTACK Jan 11 '18 at 21:52
@AlanSTACK: yes. Anything that refers to "characters" is in terms of UTF-16 code units, not codepoints or grapheme clusters (unless stated otherwise) – Remy Lebeau Jan 11 '18 at 22:11
Last clarification. So Java uses `UTF-16` internally. Even though `defaultCharset()` returns `UTF-8`. And it only uses `UTF-8` when it needs to serialize it - like writing JSON to disk for example. Am I correct? – AlanSTACK Jan 11 '18 at 23:12
1

@AlanSTACK: Java's `String` uses a UTF-16 interface (though it might not actually be UTF-16 in memory, if it can be compacted to use Latin-1 without any data loss). `Charset` is used for things like serialization, not normal `String` operations. – Remy Lebeau Jan 12 '18 at 01:54

How to properly handle string lengths and encodings in Android?

What am I confused about

What I am currently working with

What I need help with

1 Answers1