20

I have a situation where I need to know the size of a String/encoding pair, in bytes, but cannot use the getBytes() method because 1) the String is very large and duplicating the String in a byte[] array would use a large amount of memory, but more to the point 2) getBytes() allocates a byte[] array based on the length of the String * the maximum possible bytes per character. So if I have a String with 1.5B characters and UTF-16 encoding, getBytes() will try to allocate a 3GB array and fail, since arrays are limited to 2^32 - X bytes (X is Java version specific).

So - is there some way to calculate the byte size of a String/encoding pair directly from the String object?

UPDATE:

Here's a working implementation of jtahlborn's answer:

private class CountingOutputStream extends OutputStream {
    int total;

    @Override
    public void write(int i) {
        throw new RuntimeException("don't use");
    }
    @Override
    public void write(byte[] b) {
        total += b.length;
    }

    @Override public void write(byte[] b, int offset, int len) {
        total += len;
    }
}
elhefe
  • 3,404
  • 3
  • 31
  • 45
  • The length in bytes depends on your target encoding. For example, "test".getBytes("UTF-8") is 4 bytes, but "test".getBytes("UTF-16") is 10 bytes (yes, 10, try it). So you need to clarify your question a bit. – brettw Nov 08 '13 at 07:02
  • I would add that it is also dependant on the code point ("characters") you are encoding. For example, in UTF-16, certain code point uses 1 code unit, other uses 2 (a code unit is 16 bits long). UTF-8 can take anywhere from 1 to 4 bytes per character. – Francis Nov 08 '13 at 07:17
  • @brettw Sorry if I'm being dense, but yes, your comment is the point of the question: given a String and an encoding, how many bytes does encoding the String require? Rereading the question, that seems pretty clear to me - do you have any suggestions for rewording it? – elhefe Nov 08 '13 at 07:30
  • @Francis the comment above applies to your comment as well, to the best of my ability to tell. – elhefe Nov 08 '13 at 07:31
  • `getByte` does not create an array bigger then it needs to be. It creates an array of the correct size for the given string. It does not creates an array of length "length of the String * the maximum possible bytes per character". And `string.length()` does not return the number of characters in a string, it returns the number of code units. For UTF-16, a code unit is 16 bits, and the number of code units per character is either 1 or 2, it depends on the character. Therefore, either I don`t understand your second point in your question, or your assumption is not correct. – Francis Nov 08 '13 at 07:51
  • @Francis actually not quite correct. ``String.getBytes()`` calls ``StringCoding.encode()`` which allocates a *maximal* array that is [length * maximum bytes per character] for the charset (6 in UTF-8). Only after encoding does it trim the array. – brettw Nov 08 '13 at 08:32
  • @elhefe well that is what was not clear for me from the question. You are talking about the way `getBytes` is implemented, I understood that your point was that array *returned* by the method was of the maximum theoretical size. – Francis Nov 08 '13 at 08:41
  • Do you need this for the general case or just a subset (for example UTF-8 and UTF-16)? Because in the later case the code is relatively easy to write. – Joachim Sauer Nov 08 '13 at 09:07
  • @Francis As brettw says, the getBytes() method does indeed allocate a `byte[]` array that can be much larger than the length of the `String`, and therefore cause OOM errors due to attempting to allocate an array with > 2^32-X elements. By 'code unit' do you mean 'code point'? Assuming that's the case, a code point is either 16 or 32 bits and there's one code point per string character, and one or two primitive `char`s per code point / string character. `String.length()` returns the number of primitive `char`s in the string, not the code point count. – elhefe Nov 08 '13 at 19:31
  • @Francis the point of part 2) was that it's impossible to use getBytes() at all on large strings because the method fails. – elhefe Nov 08 '13 at 19:49
  • @JoachimSauer Out of curiosity, how would limiting the charsets to a finite subset make the code easier to write? – elhefe Nov 08 '13 at 19:50
  • @elhefe: it would make simply implementing the algorithm independently from the `Charset` supported by the JVM. – Joachim Sauer Nov 08 '13 at 20:54
  • @Francis never mind about the code point vs code unit stuff, I get you now. – elhefe Nov 08 '13 at 21:27

5 Answers5

12

Simple, just write it to a dummy output stream:

class CountingOutputStream extends OutputStream {
  private int _total;

  @Override public void write(int b) {
    ++_total;
  }

  @Override public void write(byte[] b) {
    _total += b.length;
  }

  @Override public void write(byte[] b, int offset, int len) {
    _total += len;
  }

  public int getTotalSize(){
     _total;
  }
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
  int end = Math.min(myString.length(), i+8096);
  writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

it's not only simple, but probably just as fast as the other "complex" answers.

jtahlborn
  • 52,909
  • 5
  • 76
  • 118
  • The COS class doesn't compile, but I added a working implementation to the original question. – elhefe Nov 10 '13 at 03:23
  • 1
    @elhefe - your version may compile, but it is incorrect. you don't want to use the offset in the calculation. – jtahlborn Nov 10 '13 at 04:11
  • 1
    Whoops, fixed. Apparently only the write(byte[]) method was used by my tests. – elhefe Nov 10 '13 at 04:32
  • @jthahlborn for `int`, you increment only once. Did you assume only a single digit can be passed to this method? I think it should be number of digits that are to be added to `_total`. Can you please clarify – mtk Sep 17 '14 at 11:02
  • @mtk - `write(int)` writes a single byte, yes. – jtahlborn Sep 17 '14 at 12:42
  • Size of a 4GB string would not fit in int. How would you adjust? Would you just change the type of _total from int to long? or you think it requires changes in overridden methods? – Amin Suzani Oct 16 '15 at 17:57
  • 2
    @AminSuzani - changing `_total` to a `long` would be sufficient. – jtahlborn Oct 16 '15 at 18:53
  • 1
    I am not exactly sure what is being saved here. The String is still going to be duplicated into a char array by the OutputStreamWriter (via the StreamEncoder.write((String str, int off, int len) method) before it tries to do the byte conversion. – Gareth Jan 17 '16 at 02:34
  • @Gareth - this solution gives you a relatively simple, relatively efficient way to solve the OP's problem. the _most_ efficient solution (as far as i can figure) would involve writing your own character encoder. if you have a better solution, you can add your own answer (i'll vote for it)! – jtahlborn Jan 17 '16 at 21:29
  • 1
    But it doesn't solve the OP's problem. You are just replacing the the byte[] array allocation the OP was trying to get rid of with another (the char[] array) which will likely turn out around the same size. Of course, if I had a solution I would post it :). – Gareth Jan 19 '16 at 18:13
  • 1
    @Gareth - ew, i had assumed that the OutputStreamWriter implementation was not that, uh, "simple". you're right, looking at the code, you'll end up with a copy of the original String. You basically need to check the write call to the OutputStreamWriter. – jtahlborn Jan 19 '16 at 18:57
  • 1
    @Gareth - updated my answer to avoid that situation. – jtahlborn Jan 19 '16 at 19:01
  • @jtahlborn - I think your updated answer could munge the string if the start or end character index falls in the middle of a surrogate pair. – elhefe Feb 11 '16 at 20:26
  • @elhefe - no, it won't. the underlying encoder should handle that correctly. – jtahlborn Feb 11 '16 at 20:56
  • @jtahlborn - just tested to be sure and you are correct. My apologies. – elhefe Feb 11 '16 at 23:06
2

The same using apache-commons libraries:

public static long stringLength(String string, Charset charset) {

    try (NullOutputStream nul = new NullOutputStream();
         CountingOutputStream count = new CountingOutputStream(nul)) {

        IOUtils.write(string, count, charset.name());
        count.flush();
        return count.getCount();
    } catch (IOException e) {
        throw new IllegalStateException("Unexpected I/O.", e);
    }
}
30thh
  • 10,861
  • 6
  • 32
  • 42
2

Guava has an implementation according to this post:

Utf8.encodedLength()

Caio Cunha
  • 23,326
  • 6
  • 78
  • 74
1

Here's an apparently working implementation:

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

public class TestUnicode {

    private final static int ENCODE_CHUNK = 100;

    public static long bytesRequiredToEncode(final String s,
            final Charset encoding) {
        long count = 0;
        for (int i = 0; i < s.length(); ) {
            int end = i + ENCODE_CHUNK;
            if (end >= s.length()) {
                end = s.length();
            } else if (Character.isHighSurrogate(s.charAt(end))) {
                end++;
            }
            count += encoding.encode(s.substring(i, end)).remaining() + 1;
            i = end;
        }
        return count;
    }

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 100; i++) {
            sb.appendCodePoint(11614);
            sb.appendCodePoint(1061122);
            sb.appendCodePoint(2065);
            sb.appendCodePoint(1064124);
        }
        Charset cs = StandardCharsets.UTF_8;

        System.out.println(bytesRequiredToEncode(new String(sb), cs));
        System.out.println(new String(sb).getBytes(cs).length);
    }
}

The output is:

1400
1400

In practice I'd increase ENCODE_CHUNK to 10MChars or so.

Probably slightly less efficient than brettw's answer, but simpler to implement.

elhefe
  • 3,404
  • 3
  • 31
  • 45
  • This isn’t so bad, considering that the `OutputStreamWriter` of the other solution will also perform an actual encoding operation into a buffer, before passing it to the `CountingOutputStream`. The only disadvantage is that your solution allocates new `ByteBuffer` instances. When you fix that by implementing the [standard encoding loop](https://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html#steps), you’ve got the fastest possible (generic) solution. See [this answer](https://stackoverflow.com/a/43588801/2711488) for a cheap calculation specifically for UTF-8. – Holger Jan 11 '19 at 15:43
-2

Ok, this is extremely gross. I admit that, but this stuff is hidden by the JVM, so we have to dig a little. And sweat a little.

First, we want the actual char[] that backs a String without making a copy. To do this we have to use reflection to get at the 'value' field:

char[] chars = null;
for (Field field : String.class.getDeclaredFields()) {
    if ("value".equals(field.getName())) {
        field.setAccessible(true);
        chars = (char[]) field.get(string); // <--- got it!
        break;
    }
}

Next you need to implement a subclass of java.nio.ByteBuffer. Something like:

class MyByteBuffer extends ByteBuffer {
    int length;            
    // Your implementation here
};

Ignore all of the getters, implement all of the put methods like put(byte) and putChar(char) etc. Inside something like put(byte), increment length by 1, inside of put(byte[]) increment length by the array length. Get it? Everything that is put, you add the size of whatever it is to length. But you're not storing anything in your ByteBuffer, you're just counting and throwing away, so no space is taken. If you breakpoint the put methods, you can probably figure out which ones you actually need to implement. putFloat(float) is probably not used, for example.

Now for the grand finale, putting it all together:

MyByteBuffer bbuf = new MyByteBuffer();         // your "counting" buffer
CharBuffer cbuf = CharBuffer.wrap(chars);       // wrap your char array
Charset charset = Charset.forName("UTF-8");     // your charset goes here
CharsetEncoder encoder = charset.newEncoder();  // make a new encoder
encoder.encode(cbuf, bbuf, true);               // do it!
System.out.printf("Length: %d\n", bbuf.length); // pay me US$1,000,000
brettw
  • 10,664
  • 2
  • 42
  • 59
  • 3
    You can avoid the ugly reflection stuff, by simply calling [`CharBuffer.wrap(CharSequence)`](http://docs.oracle.com/javase/7/docs/api/java/nio/CharBuffer.html#wrap(java.lang.CharSequence)) with the `String` itself. It *will* use the `char[]` from the `String` without copying (at least in Oracle JDK 7 Update 21). – Joachim Sauer Nov 08 '13 at 09:15
  • Oh nice! I did not know that. – brettw Nov 08 '13 at 09:15
  • As @JoachimSauer told long ago, there is no need for this Reflection hack, so why does this answer still start with it? Starting with Java 9, this will fail as the internal array is not a `char[]` (letting aside alternative JRE implementations where it failed even earlier). Besides that, it’s strange to loop over `getDeclaredFields()` instead of just calling `getDeclaredField("value")`, but anyway. The main idea of your answer, creating a subclass of `ByteBuffer` in the application, is impossible. – Holger Jan 11 '19 at 15:24