4

actually I have written a Java program that is printing a large String in a .txt file! Now I want to know how big the file will be, before it is generated. Actually I have the amount of chars but I don't know how to calculate the size of this file.

Freakey
  • 85
  • 1
  • 3
  • 15
  • 1
    What encoding are you using? (The Java APIs actually make this harder than one might hope...) – Jon Skeet Jul 10 '15 at 09:23
  • Are you talking about the actual file size or the actual space it will occupy on the hard disk? – crigore Jul 10 '15 at 09:26
  • @JonSkeet please post a correct answer, and also comment on how the actual file size might differ from only the content because of metadata. – Tim Biegeleisen Jul 10 '15 at 09:27
  • `generateContent().toString().getBytes().length` is not ok? – Rekin Jul 10 '15 at 09:27
  • @TimBiegeleisen: That metadata aspect depends on what you mean by "actual file size". – Jon Skeet Jul 10 '15 at 09:28
  • http://stackoverflow.com/questions/5078314/isnt-the-size-of-character-in-java-2-bytes Maybe this will help? – kevcodez Jul 10 '15 at 09:28
  • @Rekin: Nope: a) it's inefficient; b) it always uses the default encoding. – Jon Skeet Jul 10 '15 at 09:28
  • I think the OP just wants to know how much disk space will be taken up. So he likely has in mind what number would appear from right clicking the file in Explorer and looking at file properties. I am assuming there is another metric as well which could be used. – Tim Biegeleisen Jul 10 '15 at 09:29
  • @JonSkeet: encoding can be set-up, so that's a no issue. But I just noticed "before it's generated" and that makes it unusable, true. – Rekin Jul 10 '15 at 09:29
  • @TimBiegeleisen: Well right-clicking on explorer, you're likely to get two sizes - size on disk, and logical size. But as the OP hasn't actually given any indication of that, I'm not going to introduce the further complexity of blocks, metadata, compressed file systems etc. – Jon Skeet Jul 10 '15 at 09:44

2 Answers2

4

Java doesn't make this terribly easy, as far as I can see. I believe you do have to actually encode everything, but you don't need to create a big byte array... you can use a CharsetEncoder to keep encoding into a ByteBuffer in order to get the length of each part it encodes. Here's some sample code which I believe to be correct...

import java.nio.*;
import java.nio.charset.*;
import java.util.*;

public class Test {
    public static void main(String[] args) {
        String ascii = createString('A', 2500);
        String u00e9 = createString('\u00e9', 2500); // e-acute
        String euro = createString('\u20ac', 2500); // Euro sign
        // 4 UTF-16 code units, 3 Unicode code points
        String surrogatePair = "X\ud800\udc00Y"; 

        System.out.println(getEncodedLength(ascii, StandardCharsets.UTF_8));
        System.out.println(getEncodedLength(ascii, StandardCharsets.UTF_16BE));

        System.out.println(getEncodedLength(u00e9, StandardCharsets.UTF_8));
        System.out.println(getEncodedLength(u00e9, StandardCharsets.UTF_16BE));

        System.out.println(getEncodedLength(euro, StandardCharsets.UTF_8));
        System.out.println(getEncodedLength(euro, StandardCharsets.UTF_16BE));

        System.out.println(getEncodedLength(surrogatePair, StandardCharsets.UTF_8));
        System.out.println(getEncodedLength(surrogatePair, StandardCharsets.UTF_16BE));
    }


    private static String createString(char c, int length) {
        char[] chars = new char[length];
        Arrays.fill(chars, c);
        return new String(chars);
    }

    public static int getEncodedLength(String text, Charset charset) {
        ByteBuffer byteBuffer = ByteBuffer.allocate(1024);        
        CharBuffer charBuffer = CharBuffer.wrap(text);
        CharsetEncoder encoder = charset.newEncoder();

        int length = 0;
        while (encoder.encode(charBuffer, byteBuffer, false) == CoderResult.OVERFLOW) {
            length += byteBuffer.position();
            byteBuffer.clear();
        }

        encoder.encode(charBuffer, byteBuffer, true);
        length += byteBuffer.position();
        return length;
    }
}

Output:

2500
5000
5000
5000
7500
5000
6
8
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
2

When you say "how big the file will be", I assume you mean the number of bytes stored in the file.

Assuming you're encoding with UTF-8, the pessimistic estimation is 3 times the character count in your string because it encodes some Unicode codepoints with 3-byte codewords. It also uses 4-byte codewords, but those match exactly the UTF-16 surrogate pairs. A surrogate pair consists of two Java chars so the byte-to-char ratio for them is just 2.

If your file keeps just to the ASCII subset of Unicode then the estimation is equal to the number of characters in the string.

To get the exact number of bytes for UTF-8 encoding you will actually have to scan the string char by char and add the size of each particular codeword. You can refer to the Wikipedia page on UTF-8 to find out these sizes.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436