5

My Java spark program ingests a file of 3.7 GB. When I launch the spark program and go to the Spark UI on port localhost:4040 The input size shown for the load stage is 7.3 GB??? That's really confusing. Why is the input size in the Spark UI console showing almost double than the actual file size being ingested?

enter image description here

user836087
  • 2,271
  • 8
  • 23
  • 33

1 Answers1

6

The input size:

  • Is estimated.
  • Is not the input size of the file you load, but the input size of the loaded object, which in general, require more memory to store than a serialized objects (pointers to actual objects, overhead of the data structures used to load the data).
user10553610
  • 151
  • 1
  • I don't think this answers the question. It can't be off by double size. I wanna know what is Spark doing that makes the input size double in the Spark UI. Is this a Spark UI bug? – user836087 Oct 24 '18 at 18:32
  • @user836087 Without judging the answer - https://stackoverflow.com/a/18030595 - even for a simple string the overhead is not negligible. – 10465355 Oct 24 '18 at 23:25
  • 1
    user836087 I agree with @user10553610 and also have to state that while you have a real question in what makes the sizes different, I see no reason to think that it is calculating it incorrectly as your question suggest. It is just different due to not being the same thing. The data is serialized and overhead is added and its an estimate and most importantly the input size is not the filesize of the source it is how much data is read in. – Robert Beatty Oct 25 '18 at 13:35
  • Remember spark is a distributed system it may have to read in the same blocks multiple times for various reasons. With those things in mind it makes no sense for the sizes to be the same. If you really do care why the sizes are different you could get on github look at the code and also find the committers and message them to find out why. I'm sure you will learn some interesting things about spark, but I have doubts the size is truly incorrect. – Robert Beatty Oct 25 '18 at 13:35
  • These are speculations. There's no reason for that UI to present a size greater than the file. It's either an error or something missing in the UI in terms of description about what exactly it is. – user836087 Oct 27 '18 at 17:16