3

I have used this answer to "manually" convert from unicode to UTF-8 code units. The problem is that I need the resulting UTF-8 to be contained in a byte array. How can I do that by using shifting operations whenever possible to go from hexadecimal to uft-8?

The code I already have is the following:

 public static void main(String[] args)
   throws UnsupportedEncodingException, CharacterCodingException {

   String st = "ñ";

   for (int i = 0; i < st.length(); i++) {
      int unicode = st.charAt(i);
      codepointToUTF8(unicode);
   }
 }

 public static byte[] codepointToUTF8(int codepoint) {
    byte[] hb = codepointToHexa(codepoint);
    byte[] binaryUtf8 = null;

    if (codepoint <= 0x7F) {
      binaryUtf8 = parseRange(hb, 8);
    } else if (codepoint <= 0x7FF) {
      binaryUtf8 = parseRange(hb, 16);
    } else if (codepoint <= 0xFFFF) {
      binaryUtf8 = parseRange(hb, 24);
    } else if (codepoint <= 0x1FFFFF) {
      binaryUtf8 = parseRange(hb, 32);
    }

    byte[] utf8Codeunits = new byte[hexStr.length()];
    for (int i = 0; i < hexStr.length(); i++) {
      utf8Codeunits[i] = (byte) hexStr.charAt(i);
      System.out.println(utf8Codeunits[i]); // prints 99 51 98 49,
      // which is the same as c3b1, the UTF-8 for ñ
    }

    return binaryUtf8;
  }


  public static byte[] codepointToHexa(int codepoint) {
    int n = codepoint;
    int m;

    List<Byte> list = new ArrayList<>();
    while (n >= 16) {
      m = n % 16;
      n = n / 16;
      list.add((byte) m);
    }
    list.add((byte) n);
    byte[] bytes = new byte[list.size()];
    for (int i = list.size() - 1; i >= 0; i--) {
      bytes[list.size() - i - 1] = list.get(i);
    }

    return bytes;
  }

  private static byte[] parseRange(byte[] hb, int length) {

    byte[] binarybyte = new byte[length];
    boolean[] filled = new boolean[length];

    int index = 0;
    if (length == 8) {
      binarybyte[0] = 0;
      filled[0] = true;
    } else {
      int cont = 0;
      while (cont < length / 8) {
        filled[index] = true;
        binarybyte[index++] = 1;
        cont++;
      }
      binarybyte[index] = 0;
      filled[index] = true;
      index = 8;
      while (index < length) {
        filled[index] = true;
        binarybyte[index++] = 1;
        binarybyte[index] = 0;
        filled[index] = true;
        index += 7;
      }
    }

    byte[] hbbinary = convertHexaArrayToBinaryArray(hb);
    int hbindex = hbbinary.length - 1;

    for (int i = length - 1; i >= 0; i--) {
      if (!filled[i] && hbindex >= 0) {
        // we fill it and advance the iterator
        binarybyte[i] = hbbinary[hbindex];
        hbindex--;
        filled[i] = true;
      } else if (!filled[i]) {
        binarybyte[i] = 0;
        filled[i] = true;
      }
    }
    return binarybyte;
  }

 private static byte[] convertHexaArrayToBinaryArray(byte[] hb) {

    byte[] binaryArray = new byte[hb.length * 4];
    String aux = "";
    for (int i = 0; i < hb.length; i++) {

      aux = Integer.toBinaryString(hb[i]);
      int length = aux.length();
      // toBinaryString doesn't return a 4 bit string, so we fill it with 0s
      // if length is not a multiple of 4
      while (length % 4 != 0) {
        length++;
        aux = "0" + aux;
      }

      for (int j = 0; j < aux.length(); j++) {
        binaryArray[i * 4 + j] = (byte) (aux.charAt(j) - '0');
      }
    }
  
    return binaryArray;
  }

I don't know how to handle bytes properly, so I'm aware that the things I did are probably wrong.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
randombee
  • 699
  • 1
  • 5
  • 26
  • Is this homework? You can verify the results using `String.getBytes("UTF-8")`. And Wikipedia will show the bit patterns 10xxxxxx and such. Masking and shifting being no magic. – Joop Eggen Jul 13 '16 at 10:49
  • No, it's not homework. I need the converter for a personal project and I want it to be efficient. I know the bit patterns, since they are in the link I quoted. But I have no idea what to shift (or when to do it) with what I have to get the desired result. – randombee Jul 13 '16 at 11:00
  • ... yeah, its homework. Wanting to be more "efficient" than proven, tested and readily available JRE methods is kinda ... redundant and smells extremely like exercise. Probably a college exam - IT students get told that reinventing the wheel is the cool thing to do, nowadays ... which is horrible for their career but it does wonders for useless in-depth knowledge of implementation details. – specializt Jul 13 '16 at 11:47
  • I finished college 2 years ago. You're both wrong, I'm sorry. I just want to learn how things work by coding them myself. Otherwise, I would not learn anything about encodings, just use the existing libraries. But it seems like trying to learn is now been regarded as cheating. – randombee Jul 13 '16 at 12:02
  • no, the word "cheating" doesnt even make sense contextually - reinventing the wheel literally serves not a single purpose, there is no knowledge to be gained, only experience in useless topics. If you want to acquire knowledge you will need to read papers, test out reference implementations, assemble large groups of libraries and use them properly **or** start a career in IT research. The wheel wont teach you anything, it just does its job -- and you wont be able to understand the physics behind wheels just because you disassembled one or two but you *could* start inventing a rectangular wheel – specializt Jul 13 '16 at 12:05

1 Answers1

4

UTF-8 fills Unicode code points as follows:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
... (max 6 bytes)

Where the right most bit is the least significant one for the number.

static byte[] utf8(IntStream codePoints) {
    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    final byte[] cpBytes = new byte[6]; // IndexOutOfBounds for too large code points
    codePoints.forEach((cp) -> {
        if (cp < 0) {
            throw new IllegalStateException("No negative code point allowed");
        } else if (cp < 0x80) {
            baos.write(cp);
        } else {
            int bi = 0;
            int lastPrefix = 0xC0;
            int lastMask = 0x1F;
            for (;;) {
                int b = 0x80 | (cp & 0x3F);
                cpBytes[bi] = (byte)b;
                ++bi;
                cp >>= 6;
                if ((cp & ~lastMask) == 0) {
                    cpBytes[bi] = (byte) (lastPrefix | cp);
                    ++bi;
                    break;
                }
                lastPrefix = 0x80 | (lastPrefix >> 1);
                lastMask >>= 1;
            }
            while (bi > 0) {
                --bi;
                baos.write(cpBytes[bi]);
            }
        }
    });
    return baos.toByteArray();
}

Except for the 7 bits ASCII the encoding can be done in a loop.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • So basically, until the last iteration, we make sure to use only the last 6 bits of the codepoint by making the & with 0x3F, then change the first bit to 1 to make the prefix 10 and remove those 6 bits by shifting to the right. In the last iteration, we do the same with the last prefix, which changes from 11000000 to 11100000 to 11110000... in every iteration to make sure we're using the appropriate prefix. Very useful, thank you! – randombee Jul 13 '16 at 13:22
  • Yes in the multibyte sequence all continuation bytes are 01xxxxxx. – Joop Eggen Jul 13 '16 at 13:26
  • 1
    Note that *standard* UTF-8 can *technically* use up to 6 bytes to encode codepoints up to `U+7FFFFFFF`, but *legally* can only use up to 4 bytes (Java's *Modified* UTF-8 can go up to 6 bytes). [RFC 3629](https://tools.ietf.org/html/rfc3629) restricts the highest codepage that UTF-8 can legally handle to `U+10FFFF`, which is the highest codepoint that UTF-16 can physically encode, and is the highest codepoint that Unicode currently defines. – Remy Lebeau Jul 14 '16 at 03:47
  • @RemyLebeau very fine comment, Note that I exclude negative ints, "U+80000000" upwards, reserve only 6 bytes. And another rule is that the _shortest_ byte sequence should be used, as I do above by the loop condition. Another modification of Java's _Modified_ UTF-8 concerns '\u0000' is strings. As C/C++ has problems with such byte arrays as C string, it als encodes this char: 0xC0, 0x80. In DataOutputStream. – Joop Eggen Jul 14 '16 at 06:09