Understanding Hadoop Text byteoffset

Question

I ran the below program.

Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
        System.out.println(t.getLength());
        System.out.println(t.find("\u0041"));
        System.out.println(t.find("\u00DF"));
        System.out.println(t.find("\u6771"));
        System.out.println(t.find("\uD801"));
        System.out.println(t.find("\uD801\uDC00"));

Output

10
0
1
3
-1
6

From my understanding find returns the byteoffset in Text.

0041 -> 01000001 , 00DF - > 11011111, 6771 -> 0110011101110001

I am not able to understand the output. Also why

t.find("\uD801")

is -1 ?

score 1 · Accepted Answer · edited May 23 '17 at 11:59

This example has been explained in HADOOP The Definitive Guide book.

Text class stores data using UTF8 encoding. Since it uses UTF8 encoding, the indexing inside a Text is based on byte offset of UTF8 encoded characters (unlike in Java String, where the byte offset is at each character).

You can see this answer, to understand difference between Text and String in Hadoop: Difference between Text and String in Hadoop

The text: "\u0041\u00DF\u6771\uD801\uDC00", is interpreted as follows:

\u0041 ==> Its Latin letter "A". Its UTF-8 code units: 41 (1 byte)
\u00DF ==> Its Latin letter "Sharp S". Its UTF-8 code units: c3 9f (2 bytes)
\u6771 ==> A unified Han ideograph (Chinese). Its UTF-8 code units: e6 9d b1 (3 bytes)
\uD801\uDC00 ==> Deseret letter (https://en.wikipedia.org/wiki/Deseret_alphabet). Its UTF-8 code units: f0 90 90 80 (4 bytes)

Following are the byte offsets, when it is stored in Text (which is UTF-8 encoded):

Offset of "\u0041" ==> 0
Offset of "\u00DF" ==> 1 (Since previous UTF-8 character occupied 1 byte: character 41)
Offset of "\u6771" ==> 3 (Since previous UTF-8 character occupied 2 bytes: characters c3 9f)
Offset of "\uD801\uDC00" ==> 6 (Since previous UTF-8 character occupied 3 bytes: characters e6 9d b1)

Finally, the last UTF-8 character (DESERET CAPITAL LETTER LONG I) occupies 4 bytes (f0 90 90 80).

So total length is: 1 + 2 + 3 + 4 = 10.

When you do t.find("\uD801"), you get -1. Because, no such character exists in the string, as per UTF-8 encoding.

"\uD801\uDC00" is considered as a single character (DESERET CAPITAL LETTER LONG I). Hence when you query for offset of "\uD801\uDC00", you get a proper answer of 6.

Understanding Hadoop Text byteoffset

1 Answers1