0

I am reading a CSV file, and there is a header in each cell block and information pertaining to each column of the headers.

When I read the file, I want to know the total size of the csv file, so in MB. For each String line, which is more correct to use to know the overall size of the file (I will sum each line), line.getBytes().length or line.length()*Character.BYTES?

Using line.getBytes().length gives me 841 while line.length()*Character.BYTES gives me 1682. I haven't taken classes too depth on this knowledge so I am not particularly sure which one is correct for just general size on a csv file (like how it would show up in the size column of a MAC folder)

*I do need the size of each line as I will need to determine what to do with it depending on the size.

stackerstack
  • 243
  • 4
  • 16
  • 1
    It's a lot easier to get the size of a file than this: https://stackoverflow.com/questions/14478968/get-total-size-of-file-in-bytes – Robert Harvey Aug 29 '21 at 22:52
  • Java characters are 16-bit unsigned integral type. Not bytes. One character is two (or more) bytes in Java. Because Chinese and Japanese and Korean (and Hebrew and Arabic) are all also languages, and have unique characters. Unicode vs ASCII. – Elliott Frisch Aug 29 '21 at 22:52
  • @RobertHarvey yes I know that, I do need the size of each line so I can determine what to do with the line depending on the size. – stackerstack Aug 29 '21 at 22:54
  • @ElliottFrisch so i should be using ```line.length * Character.BYTES``` ? – stackerstack Aug 29 '21 at 22:57
  • https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#getBytes() - uses the platform-dependent default charset. So really it is the safer alternative, for example because ASCII characters will count as one byte if the default encoding is UTF-8. – Clashsoft Aug 29 '21 at 22:57
  • `line.getBytes().length` should be correct. It's entirely possible your entire line is ASCII. – Robert Harvey Aug 29 '21 at 22:59
  • What about `line.length()`? Exactly what do you want? You've told us two different values. But not what the input was. – Elliott Frisch Aug 29 '21 at 23:00

2 Answers2

1

Reading a line will drop the line ending characters for you. Under windows CR+LF "\r\n" under Linux "\n".

String use a char[] but a char is a 2-byte UTF-16 encoded Unicode symbol (so called code point) or a half of a Unicode symbol. Hence that the "length" is multiplied by 2.

String.getBytes() uses the default Charset. If the file is in UTF-8, one would write:

getBytes(StandardCharsets.UTF_8)

UTF-8 uses multiple bytes for special characters like é.

So there are a couple of pitfalls. Best simply to ask the size of the file with Files.size(path).

Path path = Paths.get("... .csv");
long size = File.size(path);
int intSize = size > Integer.MAX_VALUE ? Integer.MAX_VALUE : (int)size; // ahum
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
1

If you want to know the size of a file in bytes, the best way is to call File.length().

For an InputStream that isn't a file, or for obscure situations where File.length() gives an incorrect answer, you (generally) need to read the file using an InputStream, and count the bytes that you read. (You could do this using a FilterInputStream in your input stack ... before it does the byte -> character decoding.)

The approach of measuring or estimating line lengths in characters then converting that to bytes is problematic:

  1. Depending on the file's charset and the actual characters, line.length() * Character.BYTES is probably wrong ... unless the encoding is UTF-16.

    (Character.BYTES is the size in bytes of the char type. That is unrelated to the number of bytes used to encode any given char or Unicode code point in an input or output stream.)

  2. Even if you can accurately determine how many bytes are used in each line string, you don't know how many bytes are used for the line terminators. Is it 1 or 2 bytes? (Or 4: e.g. "\r\n" in UTF-16)

    Note that the fact that you are reading a file on (say) Linux doesn't mean that the file will use Linux line terminators. And strictly speaking you cannot assume that the line terminators in a file are even consistent.

  3. Even if you can accurately determine how many bytes are used for a typical line terminator, you don't know ... and cannot tell using BufferedReader or a typical CSV reader API ... if the last line has a line terminator at the end.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216