This example has been explained in HADOOP The Definitive Guide book.
Text
class stores data using UTF8
encoding. Since it uses UTF8
encoding, the indexing inside a Text
is based on byte offset of UTF8 encoded characters (unlike in Java String, where the byte offset is at each character).
You can see this answer, to understand difference between Text and String in Hadoop:
Difference between Text and String in Hadoop
The text: "\u0041\u00DF\u6771\uD801\uDC00", is interpreted as follows:
- \u0041 ==> Its Latin letter "A". Its UTF-8 code units:
41
(1 byte)
- \u00DF ==> Its Latin letter "Sharp S". Its UTF-8 code units:
c3 9f
(2 bytes)
- \u6771 ==> A unified Han ideograph (Chinese). Its UTF-8 code units:
e6 9d b1
(3 bytes)
- \uD801\uDC00 ==> Deseret letter (https://en.wikipedia.org/wiki/Deseret_alphabet). Its UTF-8 code units:
f0 90 90 80
(4 bytes)
Following are the byte offsets, when it is stored in Text
(which is UTF-8 encoded):
- Offset of "\u0041" ==> 0
- Offset of "\u00DF" ==> 1 (Since previous UTF-8 character occupied 1 byte: character
41
)
- Offset of "\u6771" ==> 3 (Since previous UTF-8 character occupied 2 bytes: characters
c3 9f
)
- Offset of "\uD801\uDC00" ==> 6 (Since previous UTF-8 character occupied 3 bytes: characters
e6 9d b1
)
Finally, the last UTF-8 character (DESERET CAPITAL LETTER LONG I) occupies 4 bytes (f0 90 90 80
).
So total length is: 1 + 2 + 3 + 4 = 10.
When you do t.find("\uD801")
, you get -1. Because, no such character exists in the string, as per UTF-8 encoding.
"\uD801\uDC00" is considered as a single character (DESERET CAPITAL LETTER LONG I). Hence when you query for offset of "\uD801\uDC00", you get a proper answer of 6.